From 198c141ba01a117e8f03a824a2768cef66ba72e4 Mon Sep 17 00:00:00 2001 From: "Jian Zhang (James)" <6593865@qq.com> Date: Fri, 23 Jun 2023 16:11:57 -0700 Subject: [PATCH] [Feature] Add new docs for v0.1.1 release (#284) *Issue #, if available:* *Description of changes:* This PR include new docs for v0.1.1 release. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu --- docs/source/advanced/language-models.rst | 184 ++++++++ .../configuration-gconstruction.rst | 37 +- .../configuration/configuration-partition.rst | 54 +-- docs/source/index.rst | 10 +- docs/source/install/env-setup.rst | 97 ++++- docs/source/scale/distributed.rst | 2 + docs/source/scale/sagemaker.rst | 397 +++++++++++++++++- docs/source/tutorials/own-data.rst | 40 +- docs/source/tutorials/quick-start.rst | 10 +- 9 files changed, 753 insertions(+), 78 deletions(-) create mode 100644 docs/source/advanced/language-models.rst diff --git a/docs/source/advanced/language-models.rst b/docs/source/advanced/language-models.rst new file mode 100644 index 0000000000..fc6fba97c4 --- /dev/null +++ b/docs/source/advanced/language-models.rst @@ -0,0 +1,184 @@ +.. _language_models: + +Use Text as Node Features +============================= +Many real world graphs have text contents as nodes' features, e.g., the title and description of a product, and the questions and comments from users. To leverage these text contents, GraphStorm supports language models (LMs), i.e., HuggingFace BERT models, to embed text contents and use these embeddings in Graph models' training and inference. + +There are two modes of using LMs in GraphStorm: + +* Embed text contents with pre-trained LMs, and then use them as the input node features, without fine-tuning the LMs. Training speed in this mode is fast, and memory consumption will be lower. However, in some cases, pre-trained LMs may not fit to the graph data well, and fail to improve performance. + +* Co-train both LMs and GML models in the same training loop. This will fine-tune the LMs to fit to graph data. In many cases this mode can improve performance, but co-train the LMs will consume much more memory, particularly GPU memory, and take much longer time to complete training loops. + +To use LMs in GraphStorm, users can follow the same procedure as the :ref:`Use Your Own Data` tutorial with some minor changes. + +* Step 1. Prepare raw data to include texts as node data; +* Step 2. Use GraphStorm graph construction tools to tokenize texts and set tokens as node features; +* Step 3. Configure GraphStorm to use LMs to embed tokenized texts as input node features; and +* Step 4. If need, configure GraphStorm to co-train LM and GNN models. + +.. Note:: + + All commands below should be run in a GraphStorm Docker container. Please refer to the :ref:`GraphStorm Docker environment setup` to prepare your environment. + + If you :ref:`set up the GraphStorm environment with pip Packages`, please replace all occurrences of "2222" in the argument ``--ssh-port`` with **22**, and clone GraphStorm toolkits. + +Prepare Raw Data +------------------ +This tutorial will use the same ACM data as the :ref:`Use Your Own Data` tutorial to demonstrate how to prepare text as node features. + +First go the ``/graphstorm/examples/`` folder. + +.. code-block:: bash + + cd /graphstorm/examples + +Then run the command to create the ACM data with the required ``raw_w_text`` format. + +.. code-block:: bash + + python3 -m /graphstorm/examples/acm_data.py --output-path /tmp/acm_raw --output-type raw_w_text + +Once successful, the command will create a set of folders and files under the ``/tmp/acm_raw/`` folder ,similar to the :ref:`outputs` in the :ref:`Use Your Own Data` tutorial. But the contents of the ``config.json`` file have a few extra lines that list the text feature columns and specify how they should be processed during graph contruction. + +The following snippet shows the information of ``author`` nodes. It indicates that the "**text**" column contains text features, and it require the GraphStorm's graph contruction tool to use a `HuggingFace BERT model `_ named ``bert-base-uncased`` to tokenize these text features during construction. + +.. code-block:: json + + "nodes": [ + { + "node_type": "author", + "format": { + "name": "parquet" + }, + "files": [ + "/tmp/acm_raw/nodes/author.parquet" + ], + "node_id_col": "node_id", + "features": [ + { + "feature_col": "feat", + "feature_name": "feat" + }, + { + "feature_col": "text", + "feature_name": "text", + "transform": { + "name": "tokenize_hf", + "bert_model": "bert-base-uncased", + "max_seq_length": 16 + } + } + ] + } + +Construct Graph +------------------ +Then we use the graph construction tool to process the ACM raw data with the following command for GraphStorm model training. + +.. code-block:: bash + + python3 -m graphstorm.gconstruct.construct_graph \ + --conf-file /tmp/acm_raw/config.json \ + --output-dir /tmp/acm_nc \ + --num-parts 1 \ + --graph-name acm + +Outcomes of this command are also same as the :ref:`Outputs of Graph Construction`. But users may notice that the ``paper``, ``author``, and ``subject`` nodes all have three additional features, named ``input_ids``,``attention_mask``, and ``token_type_ids``, which are generated by the BERT tokenizer. + +GraphStorm Language Model Configuration +----------------------------------------- +Users can set up language model in GraphStorm's configuration YAML file. Below is an example of such configuration for the ACM data. The full configuration YAML file, `acm_lm_nc.yaml `_, is located under GraphStorm's ``examples/use_your_own_data`` folder. + +.. code-block:: yaml + + lm_model: + node_lm_models: + - + lm_type: bert + model_name: "bert-base-uncased" + gradient_checkpoint: true + node_types: + - paper + - author + - subject + +The current version of GraphStorm supports pre-trained BERT models from HuggingFace reposity on nodes only. Users can choose any `HuggingFace BERT models `_. But the value of ``model_name`` **MUST** be the same as the one specified in the raw data JSON file's ``bert_model`` field. Here in the example, it is the ``bert-base-uncased`` model. + +The ``node_type`` field lists the types of nodes that have tokenized text features. In this ACM example, all three types of nodes have tokenized text features, which are all list in the configuration YAML file. + +Launch GraphStorm Trainig without Fine-tuning BERT Models +------------------------------------------------------------ +With the above GraphStorm configuration YAML file, we can launch GraphStorm model training with the same commands as in the :ref:`Step 3: Launch training script on your own graphs`. + +First, we create the ``ip_list.txt`` file for the standalone mode. + +.. code-block:: bash + + touch /tmp/ip_list.txt + echo 127.0.0.1 > /tmp/ip_list.txt + +Then, the launch command is almost the same except that in this case the configuration file is ``acm_lm_nc.yaml``, which contains the language model configurations. + +.. code-block:: bash + + python3 -m graphstorm.run.gs_node_classification \ + --workspace /tmp \ + --part-config /tmp/acm_nc/acm.json \ + --ip-config /tmp/ip_list.txt \ + --num-trainers 4 \ + --num-servers 1 \ + --num-samplers 0 \ + --ssh-port 2222 \ + --cf /tmp/acm_lm_nc.yaml \ + --save-model-path /tmp/acm_nc/models \ + --node-feat-name paper:feat author:feat subject:feat + +In the training process, GraphStorm will first use the specified BERT model to compute the text embeddings in the specified node types. And then the text embeddings and other node features are concatenated together as the input node feature for GNN models training. + +Launch GraphStorm Trainig for both BERT and GNN Models +--------------------------------------------------------- +To co-train BERT and GNN models, we need to add one more argument, ``--lm-train-nodes``, to either the launch command or the configuration YAML file. Below command sets this argument to the launch command. + +.. code-block:: bash + + python3 -m graphstorm.run.gs_node_classification \ + --workspace /tmp \ + --part-config /tmp/acm_nc/acm.json \ + --ip-config /tmp/ip_list.txt \ + --num-trainers 4 \ + --num-servers 1 \ + --num-samplers 0 \ + --ssh-port 2222 \ + --cf /tmp/acm_lm_nc.yaml \ + --save-model-path /tmp/acm_nc/models \ + --node-feat-name paper:feat author:feat subject:feat \ + --lm-train-nodes 10 + +The ``--lm-train-nodes`` argument determines how many nodes will be used in each mini-batch per GPU to tune the BERT models. Because the BERT models are normally large, training of them will consume many memories. If use all nodes to co-train BERT and GNN models, it could cause GPU out of memory (OOM) errors. Use a smaller number for the ``--lm-train-nodes`` could reduce the overall GPU memory consumption. + +.. note:: It will take longer time to co-train BERT and GNN models compared to no co-train. + +Only Use BERT Models +------------------------ +GraphStorm also allows users to only use BERT models to perform graph tasks. We can add another argument, ``--lm-encoder-only``, to control whether only use BERT models or not. + +If users want to fine tune the BERT model only, just add the ``--lm-train-nodes`` argument as the command below: + +.. code-block:: bash + + python3 -m graphstorm.run.gs_node_classification \ + --workspace /tmp \ + --part-config /tmp/acm_nc/acm.json \ + --ip-config /tmp/ip_list.txt \ + --num-trainers 4 \ + --num-servers 1 \ + --num-samplers 0 \ + --ssh-port 2222 \ + --cf /tmp/acm_lm_nc.yaml \ + --save-model-path /tmp/acm_nc/models \ + --node-feat-name paper:feat author:feat subject:feat \ + --lm-encoder-only \ + --lm-train-nodes 10 + +.. note:: The current version of GraphStorm requires **ALL** node types must have text features when users want to do the above graph-aware LM fine-tuning only. diff --git a/docs/source/configuration/configuration-gconstruction.rst b/docs/source/configuration/configuration-gconstruction.rst index 401413dbd2..bbba736af8 100644 --- a/docs/source/configuration/configuration-gconstruction.rst +++ b/docs/source/configuration/configuration-gconstruction.rst @@ -6,19 +6,19 @@ Graph Construction `construct_graph.py `_ arguments -------------------------------------------------------------------------------------------------------------------------------------- -* **--conf-file**: (**Required**) the path of the configuration JSON file. -* **--num-processes**: the number of processes to process the data simulteneously. Default is 1. Increase this number can speed up data processing. -* **--num-processes-for-nodes**: the number of processes to process node data simulteneously. Increase this number can speed up node data processing. -* **--num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing. -* **--output-dir**: (**Required**) the path of the output data files. -* **--graph-name**: (**Required**) the name assigned for the graph. -* **--remap-node_id**: boolean value to decide whether to rename node IDs or not. Default is true. -* **--add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Default is true. -* **--output-format**: the format of constructed graph, options are ``DGL`` and ``DistDGL``. Default is ``DistDGL``. The output format is explained in the :ref:`Output ` section below. -* **--num-parts**: the number of partitions of the constructed graph. This is only valid if the output format is ``DistDGL``. -* **--skip-nonexist-edges**: boolean value to decide whether skip edges whose endpoint nodes don't exist. Default is true. -* **--ext-mem-workspace**: the directory where the tool can store data during graph construction. Suggest to use high-speed SSD as the external memory workspace. -* **--ext-mem-feat-size**: the minimal number of feature dimensions that features can be stored in external memory. Default is 64. +* **-\-conf-file**: (**Required**) the path of the configuration JSON file. +* **-\-num-processes**: the number of processes to process the data simulteneously. Default is 1. Increase this number can speed up data processing. +* **-\-num-processes-for-nodes**: the number of processes to process node data simulteneously. Increase this number can speed up node data processing. +* **-\-num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing. +* **-\-output-dir**: (**Required**) the path of the output data files. +* **-\-graph-name**: (**Required**) the name assigned for the graph. +* **-\-remap-node_id**: boolean value to decide whether to rename node IDs or not. Default is true. +* **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Default is true. +* **-\-output-format**: the format of constructed graph, options are ``DGL`` and ``DistDGL``. Default is ``DistDGL``. The output format is explained in the :ref:`Output ` section below. +* **-\-num-parts**: the number of partitions of the constructed graph. This is only valid if the output format is ``DistDGL``. +* **-\-skip-nonexist-edges**: boolean value to decide whether skip edges whose endpoint nodes don't exist. Default is true. +* **-\-ext-mem-workspace**: the directory where the tool can store data during graph construction. Suggest to use high-speed SSD as the external memory workspace. +* **-\-ext-mem-feat-size**: the minimal number of feature dimensions that features can be stored in external memory. Default is 64. .. _gconstruction-json: @@ -78,11 +78,11 @@ For JSON format, each line of the JSON file is a JSON object. The JSON object ca Feature transformation ......................... -Currently, the graph construction pipeline only supports three feature transformation: +Currently, the graph construction pipeline supports the following feature transformation: -* **HuggingFace tokenizer transformation** tokenizes text strings with a HuggingFace tokenizer. The ``name`` field in the feature transformation dictionary is ``tokenize_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for tokenization. ``max_seq_length`` specifies the maximal sequence length. -* **HuggingFace BERT transformation** encodes text strings with a HuggingFace BERT model. The ``name`` field in the feature transformation dictionary is ``bert_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for tokenization. ``max_seq_length`` specifies the maximal sequence length. -* **Numerical MAX_MIN transformation** normalizes numerical input features with $val = (val-min)/(max-min)$, where $val$ is the feature value, $max$ is the maximum number in the feature and $min$ is the minimum number in the feature. The ``name`` field in the feature transformation dictionary is ``max_min_norm``. The dict can contains two optional fields. ``max_bound`` specifies the maximum value allowed in the feature. Any number larger than ``max_bound`` will be set to ``max_bound``. ``min_bound`` specifies the minimum value allowed in the feature. Any number smaller than ``min_bound`` will be set to ``min_bound``. +* **HuggingFace tokenizer transformation** tokenizes text strings with a HuggingFace tokenizer. The ``name`` field in the feature transformation dictionary is ``tokenize_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for tokenization. Users can choose any `HuggingFace BERT models `_. ``max_seq_length`` specifies the maximal sequence length. +* **HuggingFace BERT transformation** encodes text strings with a HuggingFace BERT model. The ``name`` field in the feature transformation dictionary is ``bert_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for embedding text. Users can choose any `HuggingFace BERT models `_. ``max_seq_length`` specifies the maximal sequence length. +* **Numerical MAX_MIN transformation** normalizes numerical input features with `val = (val-min)/(max-min)`, where `val` is the feature value, `max` is the maximum number in the feature and `min` is the minimum number in the feature. The ``name`` field in the feature transformation dictionary is ``max_min_norm``. The dict can contains two optional fields. ``max_bound`` specifies the maximum value allowed in the feature. Any number larger than ``max_bound`` will be set to ``max_bound``. ``min_bound`` specifies the minimum value allowed in the feature. Any number smaller than ``min_bound`` will be set to ``min_bound``. * **Numerical Rank Gauss transformation** normalizes numerical input features with rank gauss normalization. It maps the numeric feature values to gaussian distribution based on ranking. The method follows https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629#250927. The ``name`` field in the feature transformation dictionary is ``rank_gauss``. The dict can contains one optional field, i.e., ``epsilon`` which is used to avoid INF float during computation. * **Convert to categorical values** converts text data to categorial values. The `name` field is `to_categorical`. `separator` specifies how to split the string into multiple categorical values (this is only used to define multiple categorical values). If `separator` is not specified, the entire string is a categorical value. `mapping` is a dict that specifies how to map a string to an integer value that defines a categorical value. @@ -90,7 +90,8 @@ Currently, the graph construction pipeline only supports three feature transform Output .......... -Currently, the graph construction pipeline outputs two output formats: DistDGL and DGL. If select ``DGL``, the output is a file, named `.dgl` under the folder specified by the **--output-dir** argument, where `` is the value of argument **--graph-name**. If select ``DistDGL``, the output is a JSON file, named `.json`, and a set of `part*` folders under the folder specified by the **--output-dir** argument, where the `*` is the number specified by the **--num-parts** argument. +Currently, the graph construction pipeline outputs two output formats: ``DistDGL`` and ``DGL``. If select ``DGL``, the output is a file, named `.dgl` under the folder specified by the **-\-output-dir** argument, where `` is the value of argument **-\-graph-name**. If select ``DistDGL``, the output is a JSON file, named `.json`, and a set of `part*` folders under the folder specified by the **-\-output-dir** argument, where the `*` is the number specified by the **-\-num-parts** argument. + By Specifying the output_format as ``DGL``, the output will be an `DGLGraph `_. By Specifying the output_format as ``DistDGL``, the output will be a partitioned graph named `DistDGL graph `_. It contains the partitioned graph, a JSON config describing the meta-information of the partitioned graph, and the mappings for the edges and nodes after partition, ``node_mapping.pt`` and ``edge_mapping.pt``, which maps each node and edge in the partitoined graph into the original node and edge id space. The node ID mapping is stored as a dictionary of 1D tensors whose key is the node type and value is a 1D tensor mapping between shuffled node IDs and the original node IDs. The edge ID mapping is stored as a dictionary of 1D tensors whose key is the edge type and value is a 1D tensor mapping between shuffled edge IDs and the original edge IDs. .. note:: The two mapping files are used to record the mapping between the ogriginal node and edge ids in the raw data files and the ids of nodes and edges in the constructed graph. They are important for mapping the training and inference outputs. Therefore, DO NOT move or delete them. diff --git a/docs/source/configuration/configuration-partition.rst b/docs/source/configuration/configuration-partition.rst index bdc8edb33e..8949dc3452 100644 --- a/docs/source/configuration/configuration-partition.rst +++ b/docs/source/configuration/configuration-partition.rst @@ -11,41 +11,41 @@ For users who are already familiar with DGL and know how to construct DGL graph, `partition_graph.py `_ arguments --------------------------------------------------------------------------------------------------------------- -- **--dataset**: (**Required**) the graph dataset name defined for the saved DGL graph file. -- **--filepath**: (**Required**) the file path of the saved DGL graph file. -- **--target-ntype**: the node type for making prediction, required for node classification/regression tasks. This argument is associated with the node type having labels. Current GraphStorm supports **one** predict node type only. -- **--ntype-task**: the node type task to perform. Only support ``classification`` and ``regression`` so far. Default is ``classification``. -- **--nlabel-field**: the field that stores labels on the predict node type, **required** if set the **target-ntype**. The format is ``nodetype:labelname``, e.g., `"paper:label"`. -- **--target-etype**: the canonical edge type for making prediction, **required** for edge classification/regression tasks. This argument is associated with the edge type having labels. Current GraphStorm supports **one** predict edge type only. The format is ``src_ntype,etype,dst_ntype``, e.g., `"author,write,paper"`. -- **--etype-task**: the edge type task to perform. Only allow ``classification`` and ``regression`` so far. Default is ``classification``. -- **--elabel-field**: the field that stores labels on the predict edge type, required if set the **target-etype**. The format is ``src_ntype,etype,dst_ntype:labelname``, e.g., `"author,write,paper:label"`. -- **--generate-new-node-split**: a boolean value, required if need the partition script to split nodes for training/validation/test sets. If set this argument ``true``, **must** set the **target-ntype** argument too. -- **--generate-new-edge-split**: a boolean value, required if need the partition script to split edges for training/validation/test sets. If set this argument ``true``, you must set the **target-etype** argument too. -- **--train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training. -- **--val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of nodes/edges for validation. +- **-\-dataset**: (**Required**) the graph dataset name defined for the saved DGL graph file. +- **-\-filepath**: (**Required**) the file path of the saved DGL graph file. +- **-\-target-ntype**: the node type for making prediction, required for node classification/regression tasks. This argument is associated with the node type having labels. Current GraphStorm supports **one** predict node type only. +- **-\-ntype-task**: the node type task to perform. Only support ``classification`` and ``regression`` so far. Default is ``classification``. +- **-\-nlabel-field**: the field that stores labels on the predict node type, **required** if set the **target-ntype**. The format is ``nodetype:labelname``, e.g., `"paper:label"`. +- **-\-target-etype**: the canonical edge type for making prediction, **required** for edge classification/regression tasks. This argument is associated with the edge type having labels. Current GraphStorm supports **one** predict edge type only. The format is ``src_ntype,etype,dst_ntype``, e.g., `"author,write,paper"`. +- **-\-etype-task**: the edge type task to perform. Only allow ``classification`` and ``regression`` so far. Default is ``classification``. +- **-\-elabel-field**: the field that stores labels on the predict edge type, required if set the **target-etype**. The format is ``src_ntype,etype,dst_ntype:labelname``, e.g., `"author,write,paper:label"`. +- **-\-generate-new-node-split**: a boolean value, required if need the partition script to split nodes for training/validation/test sets. If set this argument ``true``, **must** set the **target-ntype** argument too. +- **-\-generate-new-edge-split**: a boolean value, required if need the partition script to split edges for training/validation/test sets. If set this argument ``true``, you must set the **target-etype** argument too. +- **-\-train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training. +- **-\-val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of nodes/edges for validation. .. Note:: The sum of the **train-pct** and **val-pct** should be less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct). -- **--add-reverse-edges**: if add this argument, will add reverse edges to the given graph. -- **--retain-original-features**: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to ``true``, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings. -- **--num-parts**: (**Required**) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step. -- **--output**: (**Required**) the folder path that the partitioned DGL graph will be saved. +- **-\-add-reverse-edges**: if add this argument, will add reverse edges to the given graph. +- **-\-retain-original-features**: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to ``true``, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings. +- **-\-num-parts**: (**Required**) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step. +- **-\-output**: (**Required**) the folder path that the partitioned DGL graph will be saved. `partition_graph_lp.py `_ arguments ------------------------------------------------------------------------------------------------------------------------------------ -- **--dataset**: (**Required**) the graph name defined for the saved DGL graph file. -- **--filepath**: (**Required**) the file path of the saved DGL graph file. -- **--target-etypes**: (**Required**) the canonical edge type for making prediction. GraphStorm supports **one** predict edge type only. The format is ``src_ntype,etype,dst_ntype``, e.g., `"author,write,paper"`. -- **--train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training. -- **--val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of nodes/edges for validation. +- **-\-dataset**: (**Required**) the graph name defined for the saved DGL graph file. +- **-\-filepath**: (**Required**) the file path of the saved DGL graph file. +- **-\-target-etypes**: (**Required**) the canonical edge type for making prediction. GraphStorm supports **one** predict edge type only. The format is ``src_ntype,etype,dst_ntype``, e.g., `"author,write,paper"`. +- **-\-train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training. +- **-\-val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of nodes/edges for validation. .. Note:: The sum of the **train-pct** and **val-pct** should less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct). -- **--add-reverse-edges**: if add this argument, will add reverse edges to the given graphs. -- **--train-graph-only**: boolean value to control if partition the training graph or not, default is ``true``. -- **--retain-original-features**: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to ``true``, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings. -- **--retain-etypes**: the list of canonical edge type that will be retained before partitioning the graph. This might be helpful to remove noise edges in this application. Format example: ``—-retain-etypes query,clicks,asin query,adds,asin query,purchases,asin asin,rev-clicks,query``. -- **--num-parts**: (**Required**) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step. -- **--output**: (**Required**) the folder path that the partitioned DGL graph will be saved. \ No newline at end of file +- **-\-add-reverse-edges**: if add this argument, will add reverse edges to the given graphs. +- **-\-train-graph-only**: boolean value to control if partition the training graph or not, default is ``true``. +- **-\-retain-original-features**: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to ``true``, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings. +- **-\-retain-etypes**: the list of canonical edge type that will be retained before partitioning the graph. This might be helpful to remove noise edges in this application. Format example: ``—-retain-etypes query,clicks,asin query,adds,asin query,purchases,asin asin,rev-clicks,query``. +- **-\-num-parts**: (**Required**) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step. +- **-\-output**: (**Required**) the folder path that the partitioned DGL graph will be saved. \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 703e46506e..e43519d9c7 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -20,6 +20,7 @@ Welcome to the GraphStorm Documentation and Tutorials :glob: scale/distributed + scale/sagemaker .. toctree:: :maxdepth: 1 @@ -28,6 +29,7 @@ Welcome to the GraphStorm Documentation and Tutorials :glob: advanced/own-models + advanced/language-models .. toctree:: :maxdepth: 2 @@ -54,13 +56,13 @@ Scale to Giant Graphs For experienced users who wish to train and run infernece on very large graphs, - follow the :ref:`Use GraphStorm in a Distributed Cluster` tutorial to use GraphStorm in the Distributed mode. +- follow the :ref:`Use GraphStorm on SageMaker` tutorial to use GraphStorm in the Distribute mode based on Amazon SageMaker. -Avanced Topics +Advanced Topics -------------------- -For users who want to use their own GML models in GraphStorm, - -- follow the :ref:`Use Your Own GNN Models` tutorial to learn the programming interfaces and the steps of how to modify users' own models. +- For users who want to use their own GML models in GraphStorm, follow the :ref:`Use Your Own GNN Models` tutorial to learn the programming interfaces and the steps of how to modify users' own models. +- For users who want to use text as node features, follow the :ref:`Use Text as Node Features` tutorial to learn how to leverage BERT models to use text as node features in GraphStorm. Contribution ------------- diff --git a/docs/source/install/env-setup.rst b/docs/source/install/env-setup.rst index 4a5815fff0..da3dfad6b5 100644 --- a/docs/source/install/env-setup.rst +++ b/docs/source/install/env-setup.rst @@ -2,13 +2,12 @@ Environment Setup ====================== +GraphStorm can be installed as a pip package. However, configuring a GraphStorm environment in various Operation Systems is non-trivial, therefore, GraphStorm provides Docker-based running environment for easy deployment. -For a quick and easy setup, GraphStorm can be installed as a pip package. - -However, configuring an GraphStorm environment is non-trivial. Users need to install dependencies and configure distributed PyTorch running environments. For this reason, we recommend that our users setup Docker as the base running environment to use GraphStorm. - +1. Setup GraphStorm Docker Environment +--------------------------------------- Prerequisites ------------------ +............... 1. **Docker**: You need to install Docker in your environment as the `Docker documentation `_ suggests, and the `Nvidia Container Toolkit `_. @@ -22,12 +21,12 @@ For example, in an AWS EC2 instance without Docker preinstalled, you can run the If using AWS `Deep Learning AMI GPU version`, the Nvidia Container Toolkit has been preinstalled. -2. **GPU**: The current version of GraphStorm requires **at least one GPU** installed in the instance. +2. **GPU**: The current version of GraphStorm requires **at least one Nvidia GPU** installed in the instance. .. _build_docker: Build a GraphStorm Docker image from source code --------------------------------------------------- +................................................. Please use the following command to build a Docker image from source: @@ -41,7 +40,7 @@ Please use the following command to build a Docker image from source: There are three arguments of the ``build_docker_oss4local.sh``: -1. **path-to-graphstorm** (**required**), is the absolute path of the "graphstorm" folder, where you clone and download the GraphStorm source code. For example, the path could be ``/code/graphstorm``. +1. **path-to-graphstorm** (**required**), is the absolute path of the "graphstorm" folder, where you cloned the GraphStorm source code. For example, the path could be ``/code/graphstorm``. 2. **docker-name** (optional), is the assigned name of the to be built Docker image. Default is ``graphstorm``. 3. **docker-tag** (optional), is the assigned tag name of the to be built docker image. Default is ``local``. @@ -53,9 +52,8 @@ You can use the below command to check if the new Docker image is created succes If the build succeeds, there should be a new Docker image, named *:*, e.g., ``graphstorm:local``. - Create a GraphStorm Container -------------------------------- +.............................. First, you need to create a GraphStorm container based on the Docker image built in the previous step. @@ -63,9 +61,9 @@ Run the following command: .. code:: bash - nvidia-docker run --network=host -v /dev/shm:/dev/shm/ -d --name test graphstomr:local + nvidia-docker run --network=host -v /dev/shm:/dev/shm/ -d --name test graphstorm:local -This command will create a GraphStorm contained, named ``test`` and run the container as a daemon. +This command will create a GraphStorm container, named ``test`` and run the container as a daemon. Then connect to the container by running the following command: @@ -78,3 +76,78 @@ If succeeds, the command prompt will change to the container's, like .. code-block:: console root@ip-address:/# + +.. _setup_pip: + +2. Setup GraphStorm with pip Packages +-------------------------------------- +Prerequisites +............... + +1. **Linux OS**: The current version of GraphStorm supports Linux as the Operation System. We tested GraphStorm on both Ubuntu (22.04 or later version) and Amazon Linux 2. + +2. **GPU**: The current version of GraphStorm requires **at least one Nvidia GPU** installed in the instance. + +3. **Python3**: The current version of GraphStorm requires Python installed with the version larger than **3.7**. + +Install GraphStorm +................... +Users can use ``pip`` or ``pip3`` to install GraphStorm. + +.. code-block:: bash + + pip install graphstorm + +Install Dependencies +..................... +GraphStorm requires a set of dependencies, which can be installed with the following ``pip`` or ``pip3`` commands. + +.. code-block:: bash + + pip install boto3==1.26.126 + pip install botocore==1.29.126 + pip install h5py==3.8.0 + pip install scipy + pip install tqdm==4.65.0 + pip install pyarrow==12.0.0 + pip install transformers==4.28.1 + pip install pandas + pip install scikit-learn + pip install ogb==1.3.6 + pip install psutil==5.9.5 + pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 + pip install dgl==1.0.3+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html + +Configure SSH No-password login +................................ +Use the following commands to configure a local SSH no-password login that GraphStorm relies on. + +.. code-block:: bash + + ssh-keygen -t rsa -f ~/.ssh/id_rsa -N '' + cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys + +Then use this command to test if the SSH no-password login works. + +.. code-block:: bash + + ssh 127.0.0.1 + +If everything is right, the above command will enter another Linux shell process. Then exit this new shell with the command ``exit``. + +Clone GraphStorm Toolkits (Optional) +.......................................... +GraphStorm provides a set of toolkits, including scripts, tools, and examples, which can facilitate the use of GraphStrom. + +* **graphstorm/training_scripts/** and **graphstorm/inference_scripts/** include examplar configuration yaml files that used in GraphStorm documentations and tutorials. +* **graphstorm/examples** includes Python code for customized models and customized data preparation. +* **graphstorm/tools** includes graph partition and related Python code. +* **graphstorm/sagemaker** include commands and code to run GraphStorm on Amazon SageMaker. + +Users can clone GraphStorm source code to obtain these toolkits. + +.. code-block:: bash + + git clone https://github.com/awslabs/graphstorm.git + +.. warning:: If use this method to setup GraphStorm environment, please replace the argument ``--ssh-port`` of in launch commands in GraphStorm's tutorials from 2222 with **22**. \ No newline at end of file diff --git a/docs/source/scale/distributed.rst b/docs/source/scale/distributed.rst index 388b255c88..2b89821bae 100644 --- a/docs/source/scale/distributed.rst +++ b/docs/source/scale/distributed.rst @@ -78,6 +78,8 @@ If not, please make sure there is no limitation of port 2222. For distributed training, users also need to make sure ports under 65536 is open for DistDGL to use. +.. _partition-a-graph: + Partition a Graph ------------------------------- diff --git a/docs/source/scale/sagemaker.rst b/docs/source/scale/sagemaker.rst index 50e40a2f0b..16bc39fe11 100644 --- a/docs/source/scale/sagemaker.rst +++ b/docs/source/scale/sagemaker.rst @@ -1,17 +1,398 @@ .. _distributed-sagemaker: -[**Under construction**] +Use GraphStorm on SageMaker +=================================== +GraphStorm can run on Amazon Sagemaker to leverage SageMaker's ML DevOps capabilities. -Use GraphStorm in SageMaker -============================ - -Setup SageMaker +Prerequisites ----------------- +In order to use GraphStorm on Amazon SageMaker, users need to have AWS access to the following AWS services. +- **SageMaker service**. Please refer to `Amazon SageMaker service `_ for how to get access to Amazon SageMaker. +- **Amazon ECR**. Please refer to `Amazon Elastic Container Registry service `_ for how to get access to Amazon ECR. +- **S3 service**. Please refer to `Amazon S3 service `_ for how to get access to Amazon S3. +- **SageMaker Framework Containers**. Please follow `AWS Deep Learning Containers guideline `_ to get access to the image. +- **Amazon EC2** (optional). Please refer to `Amazon EC2 service `_ for how to get access to Amazon EC2. -Launch Training ------------------ +Setup GraphStorm SageMaker Docker Image +---------------------------------------------- +GraphStorm uses SageMaker's **BYOC** (Bring Your Own Container) mode. Therefore, before launching GraphStorm on SageMaker, there are two steps required to setup a GraphStorm SageMaker Docker image. + +.. _build_sagemaker_docker: + +Step 1: Build a SageMaker-compatible Docker image +................................................... + +.. note:: + * Please make sure your account has access key (AK) and security access key (SK) configured to authenticate accesses to AWS services. + * For more details of Amazon ECR operation via CLI, users can refer to the `Using Amazon ECR with the AWS CLI document `_. + +First, in a Linux machine, configure a Docker environment by following the `Docker documentation `_ suggestions. + +In order to use the SageMaker base Docker image, users need to use the following command to authenticate to pull SageMaker images. + +.. code-block:: bash + + aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com + +Then, clone GraphStorm source code, and build a GraphStorm SageMaker compatible Docker image from source with commands: + +.. code-block:: bash + + git clone https://github.com/awslabs/graphstorm.git + + cd /path-to-graphstorm/docker/ + + bash /path-to-graphstorm/docker/build_docker_sagemaker.sh /path-to-graphstorm/ + +The ``build_docker_sagemaker.sh`` script takes three arguments: + +1. **path-to-graphstorm** (**required**), is the absolute path of the ``graphstorm`` folder, where you cloned the GraphStorm source code. For example, the path could be ``/code/graphstorm``. +2. **DOCKER_NAME** (optional), is the assigned name of the to-be built Docker image. Default is ``graphstorm``. + +.. warning:: + In order to upload the GraphStorm SageMaker Docker image to Amazon ECR, users need to define the to include the ECR URI string, **.dkr.ecr..amazonaws.com/**, e.g., ``888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm``. + +3. **DOCKER_TAG** (optional), is the assigned tag name of the to-be built Docker image. Default is ``sm``. + +Once the ``build_docker_sagemaker.sh`` command completes successfully, there will be a Docker image, named ``:``, such as ``888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm``, in the local repository, which could be listed by running: + +.. code-block:: bash + + docker image ls + +.. _upload_sagemaker_docker: + +Step 2: Upload Docker Images to Amazon ECR Repository +....................................................... +Because SageMaker relies on Amazon ECR to access customers' own Docker images, users need to upload Docker images built in the Step 1 to their own ECR repository. + +The following command will authenticate the user account to access to user's ECR repository via AWS CLI. + +.. code-block:: bash + + aws ecr get-login-password --region | docker login --username AWS --password-stdin .dkr.ecr..amazonaws.com + +Please replace the `` and `` with your own account information and be consistent with the values used in the **Step 1**. + +In addition, users need to create an ECR repository at the specified `` with the name as `` **WITHOUT** the ECR URI string, e.g., ``graphstorm``. + +And then use the below command to push the built GraphStorm Docker image to users' own ECR repository. + +.. code-block:: bash + + docker push : +Please replace the `` and `` with the actual Docker image name and tag, e.g., ``888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm``. + +Run GraphStorm on SageMaker +---------------------------- +There are two ways to run GraphStorm on SageMaker. + +* **Run with Amazon SageMaker service**. In this way, users will use GraphStorm's tools to submit SageMaker API calls, which request SageMaker services to start new SageMaker training or inference instances that run GraphStorm code. Users can submit the API calls on a properly configured machine without GPUs (e.g., C5.xlarge). This is the formal way to run GraphStorm experiments on large graphs and to deploy GraphStorm on SageMaker for production environment. +* **Run with Docker Compose in a local environment**. In this way, users do not call the SageMaker service, but use Docker Compose to run SageMaker locally in a Linux instance that has GPUs. This is mainly for model developers and testers to simulate running GraphStorm on SageMaker. + +Run GraphStorm with Amazon SageMaker service +.............................................. +To run GraphStorm with the Amazon SageMaker service, users should set up an instance with the SageMaker library installed and GraphStorm's SageMaker tools copied. + +1. Use the below command to install SageMaker. + +.. code-block:: bash + + pip install sagemaker + +2. Copy GraphStorm SageMaker tools. Users can clone the GraphStorm repository using the following command or copy the `sagemaker folder `_ to the instance. + +.. code-block:: bash + + git clone https://github.com/awslabs/graphstorm.git + +Prepare graph data +````````````````````` +Unlike GraphStorm's :ref:`Standalone mode` and :ref:`the Distributed mode`, which rely on local disk or shared file system to store the partitioned graph, SageMaker utilizes Amazon S3 as the shared data storage for distributing partitioned graphs and the configuration YAML file. + +This tutorial uses the same three-partition OGB-MAG graph and the Link Prediction task as those introduced in the :ref:`Partition a Graph` section of the :ref:`Use GraphStorm in a Distributed Cluster` tutorial. After generating the partitioned OGB-MAG graphs, use the following commands to upload them and the configuration YAML file to an S3 bucket. + +.. code-block:: bash + + aws s3 cp --recursive /data/ogbn_mag_lp_3p s3:///ogbn_mag_lp_3p + aws s3 cp /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml s3:///mag_lp.yaml + +Please replace `` and `` with your own S3 bucket URI. + +Launch training +``````````````````` +Launching GraphStorm training on SageMaker is similar as launching in the :ref:`Standalone mode` and :ref:`the Distributed mode`, except for three diffences: + +* The launch commands are located in the ``graphstorm/sagemaker`` folder, and +* Users need to provide AWS service-related information in the command. +* All paths for saving models, embeddings, and prediction results should be specified as S3 locations using the S3 related arguments. + +Users can use the following commands to launch a GraphStorm Link Prediction training job with the OGB-MAG graph. + +.. code-block:: bash + + cd /path-to-graphstorm/sagemaker/ + + python3 launch/launch_train.py \ + --image-url \ + --region \ + --entry-point run/train_entry.py \ + --role \ + --instance-count 3 \ + --graph-data-s3 s3:///ogbn_mag_lp_3p \ + --yaml-s3 s3:///mag_lp.yaml \ + --model-artifact-s3 s3:/// \ + --graph-name ogbn-mag \ + --task-type link_prediction \ + --lp-decoder-type dot_product \ + --num-layers 1 \ + --fanout 10 \ + --hidden-size 128 \ + --backend gloo \ + --batch-size 128 + +Please replace `` with the `:` that are uploaded in the Step 2, e.g., ``888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm``, replace the `` with the region where ECR image repository is located, e.g., ``us-east-1``, and replace the `` with your AWS account ARN that has SageMaker execution role, e.g., ``"arn:aws:iam:::role/service-role/AmazonSageMaker-ExecutionRole-20220627T143571"``. + +Because we are using a three-partition OGB-MAG graph, we need to set the ``--instance-count`` to 3 in this command. + +The trained model artifact will be stored in the S3 location provided through the ``--model-artifact-s3`` argument. You can use the following command to check the model artifacts after the training completes. + +.. code-block:: bash + + aws s3 ls s3:/// Launch inference ----------------- +````````````````````` +Users can use the following command to launch a GraphStorm Link Prediction inference job on the OGB-MAG graph. + +.. code-block:: bash + + python3 launch/launch_infer.py \ + --image-url \ + --region \ + --entry-point run/infer_entry.py \ + --role \ + --instance-count 3 \ + --graph-data-s3 s3:///ogbn_mag_lp_3p \ + --yaml-s3 s3:///mag_lp.yaml \ + --model-artifact-s3 s3:/// \ + --output-emb-s3 s3:/// \ + --output-prediction-s3 s3:// \ + --graph-name ogbn-mag \ + --task-type link_prediction \ + --num-layers 1 \ + --fanout 10 \ + --hidden-size 128 \ + --backend gloo \ + --batch-size 128 + +.. note:: + + Diffferent from the training command's argument, in the inference command, the value of the ``--model-artifact-s3`` argument needs to be path to a saved model. By default, it is stored under an S3 path with specific training epoch or epoch plus iteration number, e.g., ``s3://models/epoch-0-iter-999``, where the trained model artifacts were saved. + +As the outcomes of the inference command, the generated node embeddings will be uploaded to ``s3:///``. For node classification/regression or edge classification/regression tasks, users can use ``--output-prediction-s3`` to specify the saving locations of prediction results. + +Users can use the following commands to check the corresponding outputs: + +.. code-block:: bash + + aws s3 ls s3:/// + aws s3 ls s3:/// + +Run GraphStorm SageMaker with Docker Compose +.............................................. +This section describes how to launch Docker compose jobs that emulate a SageMaker training execution environment. This can be used to develop and test GraphStorm model training and inference on SageMaker locally. + +If users have never worked with Docker compose before the official description provides a great intro: + +.. hint:: + + Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application's services. Then, with a single command, you create and start all the services from your configuration. + +We will use this capability to launch multiple worker instances locally, that will be configured to “look like” a SageMaker training instance and communicate over a virtual network created by Docker Compose. This way our test environment will be as close to a real SageMaker distributed job as we can get, without needing to launch SageMaker jobs, or launch and configure multiple EC2 instances when developing features. + +Get Started +````````````` +To run GraphStorm SageMaker with Docker Compose, we need to set up a local Linux instance with the following contents. + +1. Use the below command to install SageMaker. + +.. code-block:: bash + + pip install sagemaker + +2. Clone GraphStorm and install dependencies. + +.. code-block:: bash + + git clone https://github.com/awslabs/graphstorm.git + + pip install boto3==1.26.126 + pip install botocore==1.29.126 + pip install h5py==3.8.0 + pip install scipy + pip install tqdm==4.65.0 + pip install pyarrow==12.0.0 + pip install transformers==4.28.1 + pip install pandas + pip install scikit-learn + pip install ogb==1.3.6 + pip install psutil==5.9.5 + pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 + pip install dgl==1.0.3+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html + +3. Setup GraphStorm in the PYTHONPATH variable. + +.. code-block:: bash + + export PYTHONPATH=/PATH_TO_GRAPHSTORM/python:$PYTHONPATH + +4. Build a SageMaker compatible Docker image following the :ref:`Step 1 `. + +5. Follow the `Docker Compose `_ documentation to install Docker Compose. + +Generate a Docker Compose file +````````````````````````````````` +A Docker Compose file is a YAML file that tells Docker which containers to spin up and how to configure them. To launch the services with a Docker Compose file, we can use ``docker compose -f docker-compose.yaml up``. This will launch the container and execute its entry point. + +To emulate a SageMaker distributed execution environment based on the previously built Docker image (suppose the docker image is named ``graphstorm:sm``), you would need a Docker Compose file that resembles the following: + +.. code-block:: yaml + + version: '3.7' + + networks: + gfs: + name: gsf-network + + services: + algo-1: + image: graphstorm:sm + container_name: algo-1 + hostname: algo-1 + networks: + - gsf + command: 'xxx' + environment: + SM_TRAINING_ENV: '{"hosts": ["algo-1", "algo-2", "algo-3", "algo-4"], "current_host": "algo-1"}' + WORLD_SIZE: 4 + MASTER_ADDR: 'algo-1' + AWS_REGION: 'us-west-2' + ports: + - 22 + working_dir: '/opt/ml/code/' + + algo-2: + [...] + +Some explanation on the above elements (see the `official docs `_ for more details): + +* **image**: Specifies the Docker image that will be used for launching the container. In this case, the image is ``graphstorm:sm``, which should correspond to the previously built Docker image. +* **environment**: Sets the environment variables for the container. +* **command**: Specifies the entry point, i.e., the command that will be executed when the container launches. In this case, /path/to/entrypoint.sh is the command that will be executed. + +To help users generate yaml file automatically, GraphStorm provides a Python script, ``generate_sagemaker_docker_compose.py``, that builds the docker compose file for users. + +.. Note:: The script uses the `PyYAML `_ library. Please use the below commnd to install it. + + .. code-block:: bash + + pip install pyyaml + +This Python script has 4 required arguments that determine the Docker Compose file that will be generated: + +* **--aws-access-key-id**: The AWS access key ID for accessing S3 data within docker +* **--aws-secret-access-key**: The AWS secret access key for accessing S3 data within docker. +* **--aws-session-token**: The AWS session toekn used for accessing S3 data within docker. +* **--num-instances**: The number of instances we want to launch. This will determine the number of algo-x services entries our compose file ends up with. + +The rest of the arguments are passed on to ``sagemaker_train.py`` or ``sagemaker_infer.py``: + +* **--task-type**: Task type. +* **--graph-data-s3**: S3 location of the input graph. +* **--graph-name**: Name of the input graph. +* **--yaml-s3**: S3 location of yaml file for training and inference. +* **--custom-script**: Custom training script provided by customers to run customer training logic. This should be a path to the Python script within the Docker image. +* **--output-emb-s3**: S3 location to store GraphStorm generated node embeddings. This is an inference only argument. +* **--output-prediction-s3**: S3 location to store prediction results. This is an inference only argument. + +Run GraphStorm on Docker Compose for Training +``````````````````````````````````````````````` +First, use the following command to generate a Compose YAML file for the Link Prediction training on OGB-MAG graph. + +.. code-block:: bash + + python3 generate_sagemaker_docker_compose.py \ + --aws-access-key <> \ + --aws-secret-access-key \ + --aws-session-token \ + --num-instances 3 \ + --image \ + --graph-data-s3 s3:///ogbn_mag_lp_3p \ + --yaml-s3 s3:///map_lp.yaml \ + --model-artifact-s3 s3:// \ + --graph-name ogbn-mag \ + --task-type link_prediction \ + --num-layers 1 \ + --fanout 10 \ + --hidden-size 128 \ + --backend gloo \ + --batch-size 128 + +The above command will create a Docker Compose file named ``docker-compose---train.yaml``, which we can then use to launch the job. + +As our Docker Compose will use a Docker network, named ``gsf-network``, for inter-container communications, users need to run the following command to create the network before luanch Docker Compose. + +.. code-block:: bash + + docker network create "gsf-network" + +Then, use the following command to run the Link Prediction training on OGB-MAG graph. + +.. code-block:: bash + + docker compose -f docker-compose-link_prediction-3-train.yaml up + +Running the above command will launch 3 instances of the image, configured with the command and env vars that emulate a SageMaker execution environment and run the ``sagemaker_train.py`` script. + +.. Note:: The containers actually interact with S3, so the provided AWS assess key, security access key, and session token should be valid for access S3 bucket. + +Run GraphStorm on Docker Compose for Inference +``````````````````````````````````````````````` +The ``generate_sagemaker_docker_compose.py`` can build Compose file for the inference task with the same arguments as for training, and in addition, but add a new argument, ``--inference``. The below command create the Compose file for the Link Prediction inference on OGB-MAG graph. + +.. code-block:: bash + + python3 generate_sagemaker_docker_compose.py \ + --aws-access-key <> \ + --aws-secret-access-key \ + --aws-session-token \ + --num-instances 3 \ + --image \ + --graph-data-s3 s3:///ogbn_mag_lp_3p \ + --yaml-s3 s3:///map_lp.yaml \ + --model-artifact-s3 s3:// \ + --graph-name ogbn-mag \ + --task-type link_prediction \ + --num-layers 1 \ + --fanout 10 \ + --hidden-size 128 \ + --backend gloo \ + --batch-size 128 \ + --inference + +The command will create a Docker compose file named ``docker-compose---infer.yaml``. And then, we can use the same command to spin up the inference job. + +.. code-block:: bash + + docker compose -f docker-compose-link_prediction-3-infer.yaml up + +Clean Up +`````````````````` +To save computing resources, users can run the below command to clean up the Docker Compose environment. + +.. code-block:: bash + + docker compose -f docker-compose-file down diff --git a/docs/source/tutorials/own-data.rst b/docs/source/tutorials/own-data.rst index 4c1f7f42fb..e46c39a00e 100644 --- a/docs/source/tutorials/own-data.rst +++ b/docs/source/tutorials/own-data.rst @@ -12,6 +12,8 @@ It is easy for users to prepare their own graph data and leverage GraphStorm's b All commands below should be run in a GraphStorm Docker container. Please refer to the :ref:`GraphStorm Docker environment setup` to prepare your environment. + If you :ref:`set up the GraphStorm environment with pip Packages`, please replace all occurrences of "2222" in the argument ``--ssh-port`` with **22**, and clone GraphStorm toolkits. + Step 1: Prepare Your Own Graph Data ------------------------------------- There are two options to prepare your own graph data for using GraphStorm: @@ -47,6 +49,8 @@ Then run the command to create the ACM data with the required raw format. Once succeeded, the command will create a set of folders and files under the ``/tmp/acm_raw/`` folder, as shown below: +.. _acm-raw-data-output: + .. code-block:: bash /tmp/acm_raw @@ -232,6 +236,34 @@ Based on the original ACM dataset, this example builds a simple heterogenous gra .. figure:: ../../../tutorial/ACM_schema.png :align: center +Customized label split +````````````````````````` +If users want to split labels with your own logics, e.g., time sequence, you can split labels first, and then provide the split information in the configuration JSON file like the below example. + +.. code-block:: json + + "labels": [ + { + "label_col": "label", + "task_type": "classification", + "custom_split_filenames": {"train": "/tmp/acm_raw/nodes/train_idx.json", + "valid": "/tmp/acm_raw/nodes/val_idx.json", + "test": "/tmp/acm_raw/nodes/test_idx.json"} + } + ] + +Instead of using the ``split_pct``, users can specify the ``custom_split_filenames`` configuration with a value, which is a dictionary. The dictionary's keys could include ``train``, ``valid``, and ``test``, and values of the dictionary are JSON files that contains the node/edge IDs of each set. + +These JSON files only need to list the IDs on its own set. For example, in a node classification task, there are 100 nodes and node ID starts from 0, and assume the last 50 nodes (ID from 49 to 99) have labels associated. For some business logic, users want to have the first 10 of the 50 labeled nodes as training set, the last 30 as the test set, and the middle 10 as the validation set. Then the `train_idx.json` file should contain the integer from 50 to 59, and one integer per line. Similarly, the `val_idx.json` file should contain the integer from 60 to 69, and the `test_idx.json` file should contain the integer from 70 to 99. Contents of the `train_idx.json` file are like the followings. + +.. code-block:: json + + 50 + 51 + 52 + ... + 59 + .. _raw-data-files: Input raw node/edge data files @@ -406,10 +438,6 @@ Below is an example YAML configuration file for the ACM data, which sets to use sparse_optimizer_lr: 1e-2 use_node_embeddings: false node_classification: - node_feat_name: - - "paper:feat" - - "author:feat" - - "subject:feat" target_ntype: "paper" label_field: "label" multilabel: false @@ -417,6 +445,8 @@ Below is an example YAML configuration file for the ACM data, which sets to use You can copy this file to the ``/tmp`` folder within the GraphStorm container for the next step. +.. _launch_training_oyog: + Step 3: Launch training script on your own graphs --------------------------------------------------- @@ -437,7 +467,7 @@ Below is a launch script example that trains a GraphStorm built-in RGCN model on --workspace /tmp \ --part-config /tmp/acm_nc/acm.json \ --ip-config /tmp/ip_list.txt \ - --num-trainers 4 \ + --num-trainers 1 \ --num-servers 1 \ --num-samplers 0 \ --ssh-port 2222 \ diff --git a/docs/source/tutorials/quick-start.rst b/docs/source/tutorials/quick-start.rst index d6c0bf01b2..14a66ea7ad 100644 --- a/docs/source/tutorials/quick-start.rst +++ b/docs/source/tutorials/quick-start.rst @@ -14,7 +14,9 @@ This tutorial will use GraphStorm's built-in OGB-arxiv dataset for a node classi .. note:: - All commands below should be run in a GraphStorm Docker container. Please refer to the :ref:`GraphStorm Docker environment setup` to prepare your environment. + All commands below are designed to run in a GraphStorm Docker container. Please refer to the :ref:`GraphStorm Docker environment setup` to prepare the Docker container environment. + + If you :ref:`set up the GraphStorm environment with pip Packages`, please replace all occurrences of "2222" in the argument ``--ssh-port`` with **22**, and clone GraphStorm toolkits. Download and Partition OGB-arxiv Data -------------------------------------- @@ -40,13 +42,13 @@ This command will automatically download ogbn-arxiv graph data and split the gra graph.dgl node_feat.dgl -The ``ogbn-arxiv.json`` file contains meta data about the built distributed DGL graph. Because the command specifies to create one partition with the argument ``--num_parts 1``, there is one sub-folder, named ``part0``. Files in the sub-folder includes three types of data, i.e., the graph structure (``graph.dgl``), the node features (``node_feat.dgl``), and edge features (``edge_feat.dgl``). The ``node_mapping.pt`` and ``edge_mapping.pt`` contain the ID mapping between the raw node and edge IDs with the built graph's node and edge IDs. +The ``ogbn-arxiv.json`` file contains meta data about the built distributed DGL graph. Because the command specifies to create one partition with the argument ``--num-parts 1``, there is one sub-folder, named ``part0``. Files in the sub-folder includes three types of data, i.e., the graph structure (``graph.dgl``), the node features (``node_feat.dgl``), and edge features (``edge_feat.dgl``). The ``node_mapping.pt`` and ``edge_mapping.pt`` contain the ID mapping between the raw node and edge IDs with the built graph's node and edge IDs. .. _launch-training: Launch Training ----------------- -GraphStorm currently relies on **ssh** to launch its scripts. Therefore before launch any scripts, users need to create an IP address file, which contains all private IP addresses in a cluster. If run GraphStorm in the Standalone mode, which run only in a **signle machine**, as this tutorial does, users only need to run the following command to create an ``ip_list.txt`` file that has one row '**127.0.0.1**' as its content. +GraphStorm currently relies on **ssh** to launch its scripts. Therefore before launch any scripts, users need to create an IP address file, which contains all private IP addresses in a cluster. If run GraphStorm in the Standalone mode, which run only in a **single machine**, as this tutorial does, users only need to run the following command to create an ``ip_list.txt`` file that has one row '**127.0.0.1**' as its content. .. code-block:: bash @@ -112,7 +114,7 @@ Next users can check the :ref:`Use Your Own Graph Data` tutorial t Clean Up ---------- -Once finish GML tasks, users can exit the GraphStorm Docker container with command ``exit`` and then stop the container to restore computation resources. +Once finished with GML tasks, users can exit the GraphStorm Docker container with command ``exit`` and then stop the container to restore computation resources. Run this command in the **container running environment** to leave the GraphStorm container.