[Feature] Add new docs for v0.1.1 release (#284)

*Issue #, if available:* *Description of changes:* This PR include new docs for v0.1.1 release. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu <[email protected]>
awslabs · Jun 23, 2023 · 198c141 · 198c141
1 parent eb01fb7
commit 198c141
Show file tree

Hide file tree

Showing 9 changed files with 753 additions and 78 deletions.
diff --git a/docs/source/advanced/language-models.rst b/docs/source/advanced/language-models.rst
@@ -0,0 +1,184 @@
+.. _language_models:
+
+Use Text as Node Features
+=============================
+Many real world graphs have text contents as nodes' features, e.g., the title and description of a product, and the questions and comments from users. To leverage these text contents, GraphStorm supports language models (LMs), i.e., HuggingFace BERT models, to embed text contents and use these embeddings in Graph models' training and inference.
+
+There are two modes of using LMs in GraphStorm:
+
+* Embed text contents with pre-trained LMs, and then use them as the input node features, without fine-tuning the LMs. Training speed in this mode is fast, and memory consumption will be lower. However, in some cases, pre-trained LMs may not fit to the graph data well, and fail to improve performance.
+
+* Co-train both LMs and GML models in the same training loop. This will fine-tune the LMs to fit to graph data. In many cases this mode can improve performance, but co-train the LMs will consume much more memory, particularly GPU memory, and take much longer time to complete training loops.
+
+To use LMs in GraphStorm, users can follow the same procedure as the :ref:`Use Your Own Data<use-own-data>` tutorial with some minor changes.
+
+* Step 1. Prepare raw data to include texts as node data;
+* Step 2. Use GraphStorm graph construction tools to tokenize texts and set tokens as node features;
+* Step 3. Configure GraphStorm to use LMs to embed tokenized texts as input node features; and
+* Step 4. If need, configure GraphStorm to co-train LM and GNN models.
+
+.. Note::
+
+    All commands below should be run in a GraphStorm Docker container. Please refer to the :ref:`GraphStorm Docker environment setup<setup>` to prepare your environment.
+
+    If you :ref:`set up the GraphStorm environment with pip Packages<setup_pip>`, please replace all occurrences of "2222" in the argument ``--ssh-port`` with **22**, and clone GraphStorm toolkits.
+
+Prepare Raw Data
+------------------
+This tutorial will use the same ACM data as the :ref:`Use Your Own Data<use-own-data>` tutorial to demonstrate how to prepare text as node features.
+
+First go the ``/graphstorm/examples/`` folder.
+
+.. code-block:: bash
+
+    cd /graphstorm/examples 
+
+Then run the command to create the ACM data with the required ``raw_w_text`` format.
+
+.. code-block:: bash
+    
+    python3 -m /graphstorm/examples/acm_data.py --output-path /tmp/acm_raw --output-type raw_w_text
+
+Once successful, the command will create a set of folders and files under the ``/tmp/acm_raw/`` folder ,similar to the :ref:`outputs<acm-raw-data-output>` in the :ref:`Use Your Own Data<use-own-data>` tutorial. But the contents of the ``config.json`` file have a few extra lines that list the text feature columns and specify how they should be processed during graph contruction. 
+
+The following snippet shows the information of ``author`` nodes. It indicates that the "**text**" column contains text features, and it require the GraphStorm's graph contruction tool to use a `HuggingFace BERT model <https://huggingface.co/models>`_ named ``bert-base-uncased`` to tokenize these text features during construction.
+
+.. code-block:: json
+
+    "nodes": [
+        {
+            "node_type": "author",
+            "format": {
+                "name": "parquet"
+            },
+            "files": [
+                "/tmp/acm_raw/nodes/author.parquet"
+            ],
+            "node_id_col": "node_id",
+            "features": [
+                {
+                    "feature_col": "feat",
+                    "feature_name": "feat"
+                },
+                {
+                    "feature_col": "text",
+                    "feature_name": "text",
+                    "transform": {
+                        "name": "tokenize_hf",
+                        "bert_model": "bert-base-uncased",
+                        "max_seq_length": 16
+                    }
+                }
+            ]
+        }
+
+Construct Graph
+------------------
+Then we use the graph construction tool to process the ACM raw data with the following command for GraphStorm model training.
+
+.. code-block:: bash
+
+    python3 -m graphstorm.gconstruct.construct_graph \
+               --conf-file /tmp/acm_raw/config.json \
+               --output-dir /tmp/acm_nc \
+               --num-parts 1 \
+               --graph-name acm
+
+Outcomes of this command are also same as the :ref:`Outputs of Graph Construction<output-graph-construction>`. But users may notice that the ``paper``, ``author``, and ``subject`` nodes all have three additional features, named ``input_ids``,``attention_mask``, and ``token_type_ids``, which are generated by the BERT tokenizer.
+
+GraphStorm Language Model Configuration
+-----------------------------------------
+Users can set up language model in GraphStorm's configuration YAML file. Below is an example of such configuration for the ACM data. The full configuration YAML file, `acm_lm_nc.yaml <https://github.com/awslabs/graphstorm/blob/main/examples/use_your_own_data/acm_lm_nc.yaml>`_, is located under GraphStorm's ``examples/use_your_own_data`` folder.
+
+.. code-block:: yaml
+
+  lm_model:
+  node_lm_models:
+    -
+      lm_type: bert
+      model_name: "bert-base-uncased"
+      gradient_checkpoint: true
+      node_types:
+        - paper
+        - author
+        - subject
+
+The current version of GraphStorm supports pre-trained BERT models from HuggingFace reposity on nodes only. Users can choose any `HuggingFace BERT models <https://huggingface.co/models>`_. But the value of ``model_name`` **MUST** be the same as the one specified in the raw data JSON file's ``bert_model`` field. Here in the example, it is the ``bert-base-uncased`` model.
+
+The ``node_type`` field lists the types of nodes that have tokenized text features. In this ACM example, all three types of nodes have tokenized text features, which are all list in the configuration YAML file.
+
+Launch GraphStorm Trainig without Fine-tuning BERT Models
+------------------------------------------------------------
+With the above GraphStorm configuration YAML file, we can launch GraphStorm model training with the same commands as in the :ref:`Step 3: Launch training script on your own graphs<launch_training_oyog>`. 
+
+First, we create the ``ip_list.txt`` file for the standalone mode.
+
+.. code-block:: bash
+
+    touch /tmp/ip_list.txt
+    echo 127.0.0.1 > /tmp/ip_list.txt
+
+Then, the launch command is almost the same except that in this case the configuration file is ``acm_lm_nc.yaml``, which contains the language model configurations.
+
+.. code-block:: bash
+
+    python3 -m graphstorm.run.gs_node_classification \
+            --workspace /tmp \
+            --part-config /tmp/acm_nc/acm.json \
+            --ip-config /tmp/ip_list.txt \
+            --num-trainers 4 \
+            --num-servers 1 \
+            --num-samplers 0 \
+            --ssh-port 2222 \
+            --cf /tmp/acm_lm_nc.yaml \
+            --save-model-path /tmp/acm_nc/models \
+            --node-feat-name paper:feat author:feat subject:feat
+
+In the training process, GraphStorm will first use the specified BERT model to compute the text embeddings in the specified node types. And then the text embeddings and other node features are concatenated together as the input node feature for GNN models training.
+
+Launch GraphStorm Trainig for both BERT and GNN Models
+---------------------------------------------------------
+To co-train BERT and GNN models, we need to add one more argument, ``--lm-train-nodes``, to either the launch command or the configuration YAML file. Below command sets this argument to the launch command.
+
+.. code-block:: bash
+
+    python3 -m graphstorm.run.gs_node_classification \
+            --workspace /tmp \
+            --part-config /tmp/acm_nc/acm.json \
+            --ip-config /tmp/ip_list.txt \
+            --num-trainers 4 \
+            --num-servers 1 \
+            --num-samplers 0 \
+            --ssh-port 2222 \
+            --cf /tmp/acm_lm_nc.yaml \
+            --save-model-path /tmp/acm_nc/models \
+            --node-feat-name paper:feat author:feat subject:feat \
+            --lm-train-nodes 10
+
+The ``--lm-train-nodes`` argument determines how many nodes will be used in each mini-batch per GPU to tune the BERT models. Because the BERT models are normally large, training of them will consume many memories. If use all nodes to co-train BERT and GNN models, it could cause GPU out of memory (OOM) errors. Use a smaller number for the ``--lm-train-nodes`` could reduce the overall GPU memory consumption.
+
+.. note:: It will take longer time to co-train BERT and GNN models compared to no co-train.
+
+Only Use BERT Models
+------------------------
+GraphStorm also allows users to only use BERT models to perform graph tasks. We can add another argument, ``--lm-encoder-only``, to control whether only use BERT models or not.
+
+If users want to fine tune the BERT model only, just add the ``--lm-train-nodes`` argument as the command below:
+
+.. code-block:: bash
+
+    python3 -m graphstorm.run.gs_node_classification \
+            --workspace /tmp \
+            --part-config /tmp/acm_nc/acm.json \
+            --ip-config /tmp/ip_list.txt \
+            --num-trainers 4 \
+            --num-servers 1 \
+            --num-samplers 0 \
+            --ssh-port 2222 \
+            --cf /tmp/acm_lm_nc.yaml \
+            --save-model-path /tmp/acm_nc/models \
+            --node-feat-name paper:feat author:feat subject:feat \
+            --lm-encoder-only \
+            --lm-train-nodes 10
+
+.. note:: The current version of GraphStorm requires **ALL** node types must have text features when users want to do the above graph-aware LM fine-tuning only.
diff --git a/docs/source/configuration/configuration-gconstruction.rst b/docs/source/configuration/configuration-gconstruction.rst
@@ -6,19 +6,19 @@ Graph Construction
 `construct_graph.py <https://github.com/zhjwy9343/graphstorm/blob/main/python/graphstorm/gconstruct/construct_graph.py>`_ arguments
 --------------------------------------------------------------------------------------------------------------------------------------
 
-* **--conf-file**: (**Required**) the path of the configuration JSON file.
-* **--num-processes**: the number of processes to process the data simulteneously. Default is 1. Increase this number can speed up data processing.
-* **--num-processes-for-nodes**: the number of processes to process node data simulteneously. Increase this number can speed up node data processing.
-* **--num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing.
-* **--output-dir**: (**Required**) the path of the output data files.
-* **--graph-name**: (**Required**) the name assigned for the graph.
-* **--remap-node_id**: boolean value to decide whether to rename node IDs or not. Default is true.
-* **--add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Default is true.
-* **--output-format**: the format of constructed graph, options are ``DGL`` and ``DistDGL``. Default is ``DistDGL``. The output format is explained in the :ref:`Output <output-format>` section below.
-* **--num-parts**: the number of partitions of the constructed graph. This is only valid if the output format is ``DistDGL``.
-* **--skip-nonexist-edges**: boolean value to decide whether skip edges whose endpoint nodes don't exist. Default is true.
-* **--ext-mem-workspace**: the directory where the tool can store data during graph construction. Suggest to use high-speed SSD as the external memory workspace.
-* **--ext-mem-feat-size**: the minimal number of feature dimensions that features can be stored in external memory. Default is 64.
+* **-\-conf-file**: (**Required**) the path of the configuration JSON file.
+* **-\-num-processes**: the number of processes to process the data simulteneously. Default is 1. Increase this number can speed up data processing.
+* **-\-num-processes-for-nodes**: the number of processes to process node data simulteneously. Increase this number can speed up node data processing.
+* **-\-num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing.
+* **-\-output-dir**: (**Required**) the path of the output data files.
+* **-\-graph-name**: (**Required**) the name assigned for the graph.
+* **-\-remap-node_id**: boolean value to decide whether to rename node IDs or not. Default is true.
+* **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Default is true.
+* **-\-output-format**: the format of constructed graph, options are ``DGL`` and ``DistDGL``. Default is ``DistDGL``. The output format is explained in the :ref:`Output <output-format>` section below.
+* **-\-num-parts**: the number of partitions of the constructed graph. This is only valid if the output format is ``DistDGL``.
+* **-\-skip-nonexist-edges**: boolean value to decide whether skip edges whose endpoint nodes don't exist. Default is true.
+* **-\-ext-mem-workspace**: the directory where the tool can store data during graph construction. Suggest to use high-speed SSD as the external memory workspace.
+* **-\-ext-mem-feat-size**: the minimal number of feature dimensions that features can be stored in external memory. Default is 64.
 
 .. _gconstruction-json:
 
@@ -78,19 +78,20 @@ For JSON format, each line of the JSON file is a JSON object. The JSON object ca
 
 Feature transformation
 .........................
-Currently, the graph construction pipeline only supports three feature transformation:
+Currently, the graph construction pipeline supports the following feature transformation:
 
-* **HuggingFace tokenizer transformation** tokenizes text strings with a HuggingFace tokenizer. The ``name`` field in the feature transformation dictionary is ``tokenize_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for tokenization. ``max_seq_length`` specifies the maximal sequence length.
-* **HuggingFace BERT transformation** encodes text strings with a HuggingFace BERT model.  The ``name`` field in the feature transformation dictionary is ``bert_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for tokenization. ``max_seq_length`` specifies the maximal sequence length.
-* **Numerical MAX_MIN transformation** normalizes numerical input features with $val = (val-min)/(max-min)$, where $val$ is the feature value, $max$ is the maximum number in the feature and $min$ is the minimum number in the feature. The ``name`` field in the feature transformation dictionary is ``max_min_norm``. The dict can contains two optional fields. ``max_bound`` specifies the maximum value allowed in the feature. Any number larger than ``max_bound`` will be set to ``max_bound``. ``min_bound`` specifies the minimum value allowed in the feature. Any number smaller than ``min_bound`` will be set to ``min_bound``.
+* **HuggingFace tokenizer transformation** tokenizes text strings with a HuggingFace tokenizer. The ``name`` field in the feature transformation dictionary is ``tokenize_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for tokenization. Users can choose any `HuggingFace BERT models <https://huggingface.co/models>`_. ``max_seq_length`` specifies the maximal sequence length.
+* **HuggingFace BERT transformation** encodes text strings with a HuggingFace BERT model.  The ``name`` field in the feature transformation dictionary is ``bert_hf``. The dict should contain two additional fields. ``bert_model`` specifies the BERT model used for embedding text. Users can choose any `HuggingFace BERT models <https://huggingface.co/models>`_. ``max_seq_length`` specifies the maximal sequence length.
+* **Numerical MAX_MIN transformation** normalizes numerical input features with `val = (val-min)/(max-min)`, where `val` is the feature value, `max` is the maximum number in the feature and `min` is the minimum number in the feature. The ``name`` field in the feature transformation dictionary is ``max_min_norm``. The dict can contains two optional fields. ``max_bound`` specifies the maximum value allowed in the feature. Any number larger than ``max_bound`` will be set to ``max_bound``. ``min_bound`` specifies the minimum value allowed in the feature. Any number smaller than ``min_bound`` will be set to ``min_bound``.
 * **Numerical Rank Gauss transformation** normalizes numerical input features with rank gauss normalization. It maps the numeric feature values to gaussian distribution based on ranking. The method follows https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629#250927. The ``name`` field in the feature transformation dictionary is ``rank_gauss``. The dict can contains one optional field, i.e., ``epsilon`` which is used to avoid INF float during computation.
 * **Convert to categorical values** converts text data to categorial values. The `name` field is `to_categorical`. `separator` specifies how to split the string into multiple categorical values (this is only used to define multiple categorical values). If `separator` is not specified, the entire string is a categorical value. `mapping` is a dict that specifies how to map a string to an integer value that defines a categorical value.
 
 .. _output-format:
 
 Output
 ..........
-Currently, the graph construction pipeline outputs two output formats: DistDGL and DGL. If select ``DGL``, the output is a file, named `<graph_name>.dgl` under the folder specified by the **--output-dir** argument, where `<graph_name>` is the value of argument **--graph-name**. If select ``DistDGL``, the output is a JSON file, named `<graph_name>.json`, and a set of `part*` folders under the folder specified by the **--output-dir** argument, where the `*` is the number specified by the **--num-parts** argument.
+Currently, the graph construction pipeline outputs two output formats: ``DistDGL`` and ``DGL``. If select ``DGL``, the output is a file, named `<graph_name>.dgl` under the folder specified by the **-\-output-dir** argument, where `<graph_name>` is the value of argument **-\-graph-name**. If select ``DistDGL``, the output is a JSON file, named `<graph_name>.json`, and a set of `part*` folders under the folder specified by the **-\-output-dir** argument, where the `*` is the number specified by the **-\-num-parts** argument.
+
 By Specifying the output_format as ``DGL``, the output will be an `DGLGraph <https://docs.dgl.ai/en/1.0.x/generated/dgl.save_graphs.html>`_. By Specifying the output_format as ``DistDGL``, the output will be a partitioned graph named `DistDGL graph <https://doc.dgl.ai/guide/distributed-preprocessing.html#partitioning-api>`_. It contains the partitioned graph, a JSON config describing the meta-information of the partitioned graph, and the mappings for the edges and nodes after partition, ``node_mapping.pt`` and ``edge_mapping.pt``, which maps each node and edge in the partitoined graph into the original node and edge id space. The node ID mapping is stored as a dictionary of 1D tensors whose key is the node type and value is a 1D tensor mapping between shuffled node IDs and the original node IDs. The edge ID mapping is stored as a dictionary of 1D tensors whose key is the edge type and value is a 1D tensor mapping between shuffled edge IDs and the original edge IDs.
 
 .. note:: The two mapping files are used to record the mapping between the ogriginal node and edge ids in the raw data files and the ids of nodes and edges in the constructed graph. They are important for mapping the training and inference outputs. Therefore, DO NOT move or delete them.