-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Feature] Add new docs for v0.1.1 release (#284)
*Issue #, if available:* *Description of changes:* This PR include new docs for v0.1.1 release. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu <[email protected]>
- Loading branch information
1 parent
eb01fb7
commit 198c141
Showing
9 changed files
with
753 additions
and
78 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
.. _language_models: | ||
|
||
Use Text as Node Features | ||
============================= | ||
Many real world graphs have text contents as nodes' features, e.g., the title and description of a product, and the questions and comments from users. To leverage these text contents, GraphStorm supports language models (LMs), i.e., HuggingFace BERT models, to embed text contents and use these embeddings in Graph models' training and inference. | ||
|
||
There are two modes of using LMs in GraphStorm: | ||
|
||
* Embed text contents with pre-trained LMs, and then use them as the input node features, without fine-tuning the LMs. Training speed in this mode is fast, and memory consumption will be lower. However, in some cases, pre-trained LMs may not fit to the graph data well, and fail to improve performance. | ||
|
||
* Co-train both LMs and GML models in the same training loop. This will fine-tune the LMs to fit to graph data. In many cases this mode can improve performance, but co-train the LMs will consume much more memory, particularly GPU memory, and take much longer time to complete training loops. | ||
|
||
To use LMs in GraphStorm, users can follow the same procedure as the :ref:`Use Your Own Data<use-own-data>` tutorial with some minor changes. | ||
|
||
* Step 1. Prepare raw data to include texts as node data; | ||
* Step 2. Use GraphStorm graph construction tools to tokenize texts and set tokens as node features; | ||
* Step 3. Configure GraphStorm to use LMs to embed tokenized texts as input node features; and | ||
* Step 4. If need, configure GraphStorm to co-train LM and GNN models. | ||
|
||
.. Note:: | ||
|
||
All commands below should be run in a GraphStorm Docker container. Please refer to the :ref:`GraphStorm Docker environment setup<setup>` to prepare your environment. | ||
|
||
If you :ref:`set up the GraphStorm environment with pip Packages<setup_pip>`, please replace all occurrences of "2222" in the argument ``--ssh-port`` with **22**, and clone GraphStorm toolkits. | ||
|
||
Prepare Raw Data | ||
------------------ | ||
This tutorial will use the same ACM data as the :ref:`Use Your Own Data<use-own-data>` tutorial to demonstrate how to prepare text as node features. | ||
|
||
First go the ``/graphstorm/examples/`` folder. | ||
|
||
.. code-block:: bash | ||
cd /graphstorm/examples | ||
Then run the command to create the ACM data with the required ``raw_w_text`` format. | ||
|
||
.. code-block:: bash | ||
python3 -m /graphstorm/examples/acm_data.py --output-path /tmp/acm_raw --output-type raw_w_text | ||
Once successful, the command will create a set of folders and files under the ``/tmp/acm_raw/`` folder ,similar to the :ref:`outputs<acm-raw-data-output>` in the :ref:`Use Your Own Data<use-own-data>` tutorial. But the contents of the ``config.json`` file have a few extra lines that list the text feature columns and specify how they should be processed during graph contruction. | ||
|
||
The following snippet shows the information of ``author`` nodes. It indicates that the "**text**" column contains text features, and it require the GraphStorm's graph contruction tool to use a `HuggingFace BERT model <https://huggingface.co/models>`_ named ``bert-base-uncased`` to tokenize these text features during construction. | ||
|
||
.. code-block:: json | ||
"nodes": [ | ||
{ | ||
"node_type": "author", | ||
"format": { | ||
"name": "parquet" | ||
}, | ||
"files": [ | ||
"/tmp/acm_raw/nodes/author.parquet" | ||
], | ||
"node_id_col": "node_id", | ||
"features": [ | ||
{ | ||
"feature_col": "feat", | ||
"feature_name": "feat" | ||
}, | ||
{ | ||
"feature_col": "text", | ||
"feature_name": "text", | ||
"transform": { | ||
"name": "tokenize_hf", | ||
"bert_model": "bert-base-uncased", | ||
"max_seq_length": 16 | ||
} | ||
} | ||
] | ||
} | ||
Construct Graph | ||
------------------ | ||
Then we use the graph construction tool to process the ACM raw data with the following command for GraphStorm model training. | ||
.. code-block:: bash | ||
python3 -m graphstorm.gconstruct.construct_graph \ | ||
--conf-file /tmp/acm_raw/config.json \ | ||
--output-dir /tmp/acm_nc \ | ||
--num-parts 1 \ | ||
--graph-name acm | ||
Outcomes of this command are also same as the :ref:`Outputs of Graph Construction<output-graph-construction>`. But users may notice that the ``paper``, ``author``, and ``subject`` nodes all have three additional features, named ``input_ids``,``attention_mask``, and ``token_type_ids``, which are generated by the BERT tokenizer. | ||
GraphStorm Language Model Configuration | ||
----------------------------------------- | ||
Users can set up language model in GraphStorm's configuration YAML file. Below is an example of such configuration for the ACM data. The full configuration YAML file, `acm_lm_nc.yaml <https://github.com/awslabs/graphstorm/blob/main/examples/use_your_own_data/acm_lm_nc.yaml>`_, is located under GraphStorm's ``examples/use_your_own_data`` folder. | ||
.. code-block:: yaml | ||
lm_model: | ||
node_lm_models: | ||
- | ||
lm_type: bert | ||
model_name: "bert-base-uncased" | ||
gradient_checkpoint: true | ||
node_types: | ||
- paper | ||
- author | ||
- subject | ||
The current version of GraphStorm supports pre-trained BERT models from HuggingFace reposity on nodes only. Users can choose any `HuggingFace BERT models <https://huggingface.co/models>`_. But the value of ``model_name`` **MUST** be the same as the one specified in the raw data JSON file's ``bert_model`` field. Here in the example, it is the ``bert-base-uncased`` model. | ||
The ``node_type`` field lists the types of nodes that have tokenized text features. In this ACM example, all three types of nodes have tokenized text features, which are all list in the configuration YAML file. | ||
Launch GraphStorm Trainig without Fine-tuning BERT Models | ||
------------------------------------------------------------ | ||
With the above GraphStorm configuration YAML file, we can launch GraphStorm model training with the same commands as in the :ref:`Step 3: Launch training script on your own graphs<launch_training_oyog>`. | ||
First, we create the ``ip_list.txt`` file for the standalone mode. | ||
.. code-block:: bash | ||
touch /tmp/ip_list.txt | ||
echo 127.0.0.1 > /tmp/ip_list.txt | ||
Then, the launch command is almost the same except that in this case the configuration file is ``acm_lm_nc.yaml``, which contains the language model configurations. | ||
.. code-block:: bash | ||
python3 -m graphstorm.run.gs_node_classification \ | ||
--workspace /tmp \ | ||
--part-config /tmp/acm_nc/acm.json \ | ||
--ip-config /tmp/ip_list.txt \ | ||
--num-trainers 4 \ | ||
--num-servers 1 \ | ||
--num-samplers 0 \ | ||
--ssh-port 2222 \ | ||
--cf /tmp/acm_lm_nc.yaml \ | ||
--save-model-path /tmp/acm_nc/models \ | ||
--node-feat-name paper:feat author:feat subject:feat | ||
In the training process, GraphStorm will first use the specified BERT model to compute the text embeddings in the specified node types. And then the text embeddings and other node features are concatenated together as the input node feature for GNN models training. | ||
Launch GraphStorm Trainig for both BERT and GNN Models | ||
--------------------------------------------------------- | ||
To co-train BERT and GNN models, we need to add one more argument, ``--lm-train-nodes``, to either the launch command or the configuration YAML file. Below command sets this argument to the launch command. | ||
.. code-block:: bash | ||
python3 -m graphstorm.run.gs_node_classification \ | ||
--workspace /tmp \ | ||
--part-config /tmp/acm_nc/acm.json \ | ||
--ip-config /tmp/ip_list.txt \ | ||
--num-trainers 4 \ | ||
--num-servers 1 \ | ||
--num-samplers 0 \ | ||
--ssh-port 2222 \ | ||
--cf /tmp/acm_lm_nc.yaml \ | ||
--save-model-path /tmp/acm_nc/models \ | ||
--node-feat-name paper:feat author:feat subject:feat \ | ||
--lm-train-nodes 10 | ||
The ``--lm-train-nodes`` argument determines how many nodes will be used in each mini-batch per GPU to tune the BERT models. Because the BERT models are normally large, training of them will consume many memories. If use all nodes to co-train BERT and GNN models, it could cause GPU out of memory (OOM) errors. Use a smaller number for the ``--lm-train-nodes`` could reduce the overall GPU memory consumption. | ||
.. note:: It will take longer time to co-train BERT and GNN models compared to no co-train. | ||
Only Use BERT Models | ||
------------------------ | ||
GraphStorm also allows users to only use BERT models to perform graph tasks. We can add another argument, ``--lm-encoder-only``, to control whether only use BERT models or not. | ||
If users want to fine tune the BERT model only, just add the ``--lm-train-nodes`` argument as the command below: | ||
.. code-block:: bash | ||
python3 -m graphstorm.run.gs_node_classification \ | ||
--workspace /tmp \ | ||
--part-config /tmp/acm_nc/acm.json \ | ||
--ip-config /tmp/ip_list.txt \ | ||
--num-trainers 4 \ | ||
--num-servers 1 \ | ||
--num-samplers 0 \ | ||
--ssh-port 2222 \ | ||
--cf /tmp/acm_lm_nc.yaml \ | ||
--save-model-path /tmp/acm_nc/models \ | ||
--node-feat-name paper:feat author:feat subject:feat \ | ||
--lm-encoder-only \ | ||
--lm-train-nodes 10 | ||
.. note:: The current version of GraphStorm requires **ALL** node types must have text features when users want to do the above graph-aware LM fine-tuning only. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.