[sync] 1011 sync code (#6)

* add lagcl * add english * fix setup.py * modify spark conf * optimize merit, pagnn training speed * format
TuGraph-family · Oct 17, 2023 · 1bb2a13 · 1bb2a13
1 parent f62b619
commit 1bb2a13
Show file tree

Hide file tree

Showing 79 changed files with 3,837 additions and 444 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,9 @@
 
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](./LICENSE)
 
-Ant Graph Learning (AGL) 为工业级大规模图学习任务提供全链路解决方案。
+[中文文档](./README_CN.md)
+
+Ant Graph Learning (AGL) provides a comprehensive solution for graph learning tasks at an industrial scale.
 
 [//]: # (<div align="center">)
 
@@ -14,57 +16,72 @@ Ant Graph Learning (AGL) 为工业级大规模图学习任务提供全链路解
 
 [//]: # (</div>)
 
-![](./doc/core/architecture.png)
-
-工业级图学习任务具有以下特点：
-
-* 图数据复杂：
-    * 图数据规模大：典型有 十亿点，百亿边，亿级别样本。
-    * 数据依赖：一个点的 embedding 依赖周围点/边的 embedding
-    * 类型丰富： 同质/异质/动态图
-* 任务类型复杂
-    * 离线：离线训练，离线批量预测，离线全图预测
-    * 在线：在线训练，在线预测（需要与离线结果一致）
-* 使用方式/场景复杂：
-    * 多租户
-    * 使用方式多变：GNN-only，GNN+搜推广/多模态
-    * 异构资源：CPU/GPU cluster
-
-AGL应对这些问题的思路：
-
-* 图规模
-    * 图训练：训练时由大图转换为小图，解决数据依赖问题
-* 扩展性
-    * 图采样：条件过滤（索引） + 采样（随机/概率、TopK）
-    * 图表达：graph feature 能够表达 同质/异质/动态图；支持 node/edge/graph level 子图；支持只存储结构
-    * 图训练：解除图数据的数据依赖问题，可以复用成熟的DNN训练架构（如 PS, AllReduce） 进行大规模分布式训练
-* 稳定性
-    * 复用成熟的 Spark or MapReduce (图样本阶段), 以及 DNN 链路基础设施的弹性与容错能力
-* 一致性
-    * 样本一致性：图样本离线生成，在/离线预测可复用
-* 资源成本
-    * graph feature 可存储在磁盘上，减少对内存的需求
-
-基于这样的考量，AGL设计了图数据构建以及学习方案，可以在普通的集群上完成大规模图学习任务：
-
-- 图样本：AGL通过 Spark (MR) 预先抽取目标节点的 k阶邻域信息，作为 GraphFeature。
-
-
-- 图训练：训练阶段提供解析逻辑，把 GraphFeature 转换为模型所需的临接矩阵，点特征矩阵，边特征矩阵等信息。
-  通过这种将图学习任务无缝衔接到普通DNN的学习模式上，能够方便复用普通DNN模式中各种成熟的技术和基础设施。
-
-目前AGL以Pytorch为后端，同时对接了开源算法库（PyG）, 以减少用户开发负担。同时AGL针对复杂的图数据（同质/异质/动态图），沉淀了丰富的自研图算法（点分类/边预测/表征学习等)。
-
-# 如何使用
-
-* [安装说明](doc/core/install.md)
-* [流程说明](doc/core/process_description.md)
-* [构建图样本](doc/core/sampler/0_data_preparation.md)
-* [图学习教程](doc/core/graph_learning_tutorial.md)
-
-# 如何贡献代码
-
-* [Contribution Guidelines](doc/core/contribution.md)
+![](doc/core/English/images/architecture_EN.png)
+
+Graph learning tasks in industrial settings exhibit the following characteristics:
+
+* Complex graph data:
+    * Large-scale graphs: typically consisting of billions of nodes, tens of billions of edges, and millions of samples.
+    * Data dependencies: The computation of a node's embedding relies on the embeddings of its neighboring nodes/edges.
+    * Diverse types: homogeneous/heterogeneous/dynamic graph.
+* Complex task types:
+    * Offline: offline training, offline batch prediction, offline full-graph prediction.
+    * Online: online training, online prediction (consistent with offline results).
+* Complex usage/scenarios:
+    * Multi-tenancy.
+    * Diverse usage scenarios: GNN-only, GNN + search and recommendation/multi-modal.
+    * Heterogeneous resources: CPU/GPU clusters.
+
+AGL addresses these challenges by adopting the following approaches:
+
+* The Graph scale issue:
+    * AGL tackles the problem of data dependencies by transforming large graphs
+      into smaller subgraphs in advance.
+* Scalability/Extensibility:
+    * Graph sampling: conditional filtering (index) + sampling (random/probabilistic, TopK).
+    * Graph representation: AGL provides a graph-feature format that is capable of representing homogeneous,
+      heterogeneous, and dynamic graphs. Additionally, it supports node-level, edge-level, and graph-level subgraphs,
+      allowing for more granular analysis and learning. Furthermore, AGL provides the option to store only the structure
+      of the graph,
+      which can be beneficial for certain use cases.
+    * Graph training: AGL resolves the data dependency problem inherent in graph data, facilitating
+      large-scale distributed training through the utilization of mature deep neural network (DNN) training
+      architectures such as PS (Parameter Server) and AllReduce. These architectures enable efficient and scalable
+      training processes, ensuring the seamless handling of graph data on a distributed scale.
+* Stability:
+    * Reuse mature Spark or MapReduce (graph sampling phase) and DNN infrastructure for elasticity and fault tolerance.
+* Consistency:
+    * Sample consistency: graph samples generated offline can be reused for online/offline prediction.
+* Resource cost:
+    * Graph features can be stored on disk, thereby reducing the memory requirements.
+
+Based on these considerations, AGL has developed comprehensive solutions for graph data construction and learning,
+enabling the completion of large-scale graph learning tasks on regular machines or clusters:
+
+* Graph sampling:
+    * AGL leverages Spark (or MR) to pre-extract k-hop neighborhood information of target nodes as graph features.
+* Graph training:
+    * During the training phase, AGL incorporates parsing logic to convert graph features into essential components such
+      as the adjacency matrix, node feature matrix, and edge feature matrix, along with other necessary information for
+      the model. This seamless integration of graph learning tasks into the regular DNN learning mode allows for the
+      convenient reuse of mature technologies and infrastructure typically used in standard DNN workflows.
+
+AGL currently employs PyTorch as its backend and integrates open-source algorithm libraries like PyG to ease the
+development process for users.
+Furthermore, AGL has developed some in-house graph algorithms, including node classification, edge prediction, and
+representation learning, specifically tailored for handling complex graph data in various forms such as homogeneous,
+heterogeneous, and dynamic graphs.
+
+# How to use
+
+* [Installation Guide](doc/core/English/install_EN.md)
+* [Process Workflow](doc/core/English/process_description_EN.md)
+* [Generate Graph Samples](doc/core/English/sampler/0_data_preparation_EN.md)
+* [Graph Learning Tutorial](doc/core/English/graph_learning_tutorial_EN.md)
+
+# How to Contribute
+
+* [Contribution Guidelines](doc/core/English/contribution_EN.md)
 
 # Cite
 

diff --git a/README_CN.md b/README_CN.md
@@ -0,0 +1,92 @@
+# Ant Graph Learning
+
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](./LICENSE)
+
+Ant Graph Learning (AGL) 为工业级大规模图学习任务提供全链路解决方案。
+
+[//]: # (<div align="center">)
+
+[//]: # (<img src=./doc/core/architecture.png>)
+
+[//]: # (<br>)
+
+[//]: # (<b>Figure</b>: AGL Overall Architecture)
+
+[//]: # (</div>)
+
+![](doc/core/Chinese/images/architecture.png)
+
+工业级图学习任务具有以下特点：
+
+* 图数据复杂：
+    * 图数据规模大：典型有 十亿点，百亿边，亿级别样本。
+    * 数据依赖：一个点的 embedding 依赖周围点/边的 embedding
+    * 类型丰富： 同质/异质/动态图
+* 任务类型复杂
+    * 离线：离线训练，离线批量预测，离线全图预测
+    * 在线：在线训练，在线预测（需要与离线结果一致）
+* 使用方式/场景复杂：
+    * 多租户
+    * 使用方式多变：GNN-only，GNN+搜推广/多模态
+    * 异构资源：CPU/GPU cluster
+
+AGL应对这些问题的思路：
+
+* 图规模
+    * 图训练：训练时由大图转换为小图，解决数据依赖问题
+* 扩展性
+    * 图采样：条件过滤（索引） + 采样（随机/概率、TopK）
+    * 图表达：graph feature 能够表达 同质/异质/动态图；支持 node/edge/graph level 子图；支持只存储结构
+    * 图训练：解除图数据的数据依赖问题，可以复用成熟的DNN训练架构（如 PS, AllReduce） 进行大规模分布式训练
+* 稳定性
+    * 复用成熟的 Spark or MapReduce (图样本阶段), 以及 DNN 链路基础设施的弹性与容错能力
+* 一致性
+    * 样本一致性：图样本离线生成，在/离线预测可复用
+* 资源成本
+    * graph feature 可存储在磁盘上，减少对内存的需求
+
+基于这样的考量，AGL设计了图数据构建以及学习方案，可以在普通的集群上完成大规模图学习任务：
+
+- 图样本：AGL通过 Spark (MR) 预先抽取目标节点的 k阶邻域信息，作为 GraphFeature。
+
+
+- 图训练：训练阶段提供解析逻辑，把 GraphFeature 转换为模型所需的邻接矩阵，点特征矩阵，边特征矩阵等信息。
+  通过这种将图学习任务无缝衔接到普通DNN的学习模式上，能够方便复用普通DNN模式中各种成熟的技术和基础设施。
+
+目前AGL以Pytorch为后端，同时对接了开源算法库（PyG）, 以减少用户开发负担。同时AGL针对复杂的图数据（同质/异质/动态图），沉淀了丰富的自研图算法（点分类/边预测/表征学习等)。
+
+# 如何使用
+
+* [安装说明](doc/core/Chinese/install.md)
+* [流程说明](doc/core/Chinese/process_description.md)
+* [构建图样本](doc/core/Chinese/sampler/0_data_preparation.md)
+* [图学习教程](doc/core/Chinese/graph_learning_tutorial.md)
+
+# 如何贡献代码
+
+* [Contribution Guidelines](doc/core/Chinese/contribution.md)
+
+# Cite
+
+```
+@article{zhang13agl,
+  title={AGL: A Scalable System for Industrial-purpose Graph Machine Learning},
+  author={Zhang, Dalong and Huang, Xin and Liu, Ziqi and Zhou, Jun and Hu, Zhiyang and Song, Xianzheng and Ge, Zhibang and Wang, Lin and Zhang, Zhiqiang and Qi, Yuan},
+  journal={Proceedings of the VLDB Endowment},
+  volume={13},
+  number={12}
+}
+
+@inproceedings{zhang2023inferturbo,
+  title={InferTurbo: A Scalable System for Boosting Full-graph Inference of Graph Neural Network over Huge Graphs},
+  author={Zhang, Dalong and Song, Xianzheng and Hu, Zhiyang and Li, Yang and Tao, Miao and Hu, Binbin and Wang, Lin and Zhang, Zhiqiang and Zhou, Jun},
+  booktitle={2023 IEEE 39th International Conference on Data Engineering (ICDE)},
+  pages={3235--3247},
+  year={2023},
+  organization={IEEE Computer Society}
+}
+```
+
+# License
+
+[Apache License 2.0](LICENSE)
diff --git a/agl/python/data/agl_dtype.py b/agl/python/data/agl_dtype.py
@@ -16,7 +16,7 @@
 
 import numpy as np
 
-from pyagl.pyagl import AGLDType
+from pyagl import AGLDType
 
 DTypeValue = namedtuple("DTypeValue", ["name", "np_dtype", "c_dtype"])
 

diff --git a/agl/python/data/collate.py b/agl/python/data/collate.py
@@ -13,7 +13,7 @@
 import torch
 from typing import List
 
-from pyagl.pyagl import (
+from pyagl import (
     NodeSpec,
     EdgeSpec,
 )

diff --git a/agl/python/data/collate_test.py b/agl/python/data/collate_test.py
@@ -5,7 +5,7 @@
 import os
 import numpy as np
 
-from pyagl.pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec
+from pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec
 from agl.python.data.collate import AGLHomoCollateForPyG
 from agl.python.data.column import AGLDenseColumn, AGLRowColumn
 from agl.python.data.subgraph.pyg_inputs import TorchSubGraphBatchData

diff --git a/agl/python/data/column.py b/agl/python/data/column.py
@@ -93,7 +93,7 @@ def _c_decode(self, data, **kwargs):
                 if isinstance(data[0], bytes):
                     # if it is instance of bytes (encoded by utf-8). call multi_dense_decode_bytes
                     # (implemented with c++) and pass those data to c++ in a zero copy way
-                    from pyagl.pyagl import multi_dense_decode_bytes
+                    from pyagl import multi_dense_decode_bytes
 
                     data_bytesarray = [bytearray(data_t) for data_t in data]
                     res = multi_dense_decode_bytes(
@@ -107,7 +107,7 @@ def _c_decode(self, data, **kwargs):
                     res_np_array_list = [np.array(res_i) for res_i in res]
                 elif isinstance(data[0], str):
                     # if data is instance of str, passing it from Python to C++ using pybind11 will trigger a copy.
-                    from pyagl.pyagl import multi_dense_decode_string
+                    from pyagl import multi_dense_decode_string
 
                     res = multi_dense_decode_string(
                         data,

diff --git a/agl/python/data/multi_graph_feature_collate.py b/agl/python/data/multi_graph_feature_collate.py
@@ -12,7 +12,7 @@
 
 from typing import List, Union, Callable, Optional, Any, Dict
 
-from pyagl.pyagl import (
+from pyagl import (
     NodeSpec,
     EdgeSpec,
 )

diff --git a/agl/python/data/subgraph/subgraph.py b/agl/python/data/subgraph/subgraph.py
@@ -4,7 +4,7 @@
 from typing import List
 import numpy as np
 
-from pyagl.pyagl import (
+from pyagl import (
     AGLDType,
     DenseFeatureSpec,
     SparseKVSpec,

diff --git a/agl/python/data/subgraph/subgraph_test.py b/agl/python/data/subgraph/subgraph_test.py
@@ -6,7 +6,7 @@
 import numpy as np
 
 from agl.python.data.subgraph.subgraph import PySubGraph
-from pyagl.pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec
+from pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec
 
 
 class SubGraphTest(unittest.TestCase):

diff --git a/agl/python/dataset/dataset_collate_test.py b/agl/python/dataset/dataset_collate_test.py
@@ -11,7 +11,7 @@
 from agl.python.dataset.map_based_dataset import AGLTorchMapBasedDataset
 from agl.python.data.collate import AGLHomoCollateForPyG
 from agl.python.data.column import AGLDenseColumn, AGLRowColumn
-from pyagl.pyagl import AGLDType, SparseKVSpec, NodeSpec, EdgeSpec
+from pyagl import AGLDType, SparseKVSpec, NodeSpec, EdgeSpec
 
 
 class DatasetAndCollateFnTest(unittest.TestCase):

diff --git a/agl/python/examples/drgst/drgst_citeseer.py b/agl/python/examples/drgst/drgst_citeseer.py
@@ -9,7 +9,7 @@
 from agl.python.data.collate import AGLHomoCollateForPyG
 from agl.python.data.column import AGLDenseColumn, AGLRowColumn
 from agl.python.model.encoder.drgst import DRGSTEncoder
-from pyagl.pyagl import (
+from pyagl import (
     AGLDType,
     DenseFeatureSpec,
     SparseKVSpec,

diff --git a/agl/python/examples/geniepath_ppi/train_geniepath_ppi.py b/agl/python/examples/geniepath_ppi/train_geniepath_ppi.py
@@ -9,7 +9,7 @@
 from agl.python.dataset.map_based_dataset import AGLTorchMapBasedDataset
 from agl.python.data.collate import AGLHomoCollateForPyG
 from agl.python.data.column import AGLRowColumn, AGLMultiDenseColumn
-from pyagl.pyagl import (
+from pyagl import (
     AGLDType,
     DenseFeatureSpec,
     NodeSpec,

diff --git a/agl/python/examples/hegnn_acm/model_hegnn.py b/agl/python/examples/hegnn_acm/model_hegnn.py
@@ -9,7 +9,7 @@
 import torch.nn.functional as F
 
 from agl.python.data.column import AGLDenseColumn, AGLRowColumn
-from pyagl.pyagl import (
+from pyagl import (
     AGLDType,
     SparseKVSpec,
     NodeSpec,

diff --git a/agl/python/examples/kcan_movielens/data_process/submit.sh b/agl/python/examples/kcan_movielens/data_process/submit.sh
@@ -2,7 +2,6 @@ base=`dirname "$0"`
 cd "$base"
 
 python ../../run_spark.py \
-    --mode yarn \
     --jar_resource_path ../../../../java/target/flatv3-1.0-SNAPSHOT.jar \
     --input_edge_table_name ./edge_table.csv \
     --input_label_table_name ./link_table.csv \

diff --git a/agl/python/examples/kcan_movielens/kcan_subgraph_adj.py b/agl/python/examples/kcan_movielens/kcan_subgraph_adj.py
@@ -17,7 +17,7 @@
 )
 from agl.python.data.column import AGLRowColumn, AGLMultiDenseColumn
 from agl.python.model.encoder.kcan import KCANEncoder
-from pyagl.pyagl import AGLDType, DenseFeatureSpec, NodeSpec, EdgeSpec
+from pyagl import AGLDType, DenseFeatureSpec, NodeSpec, EdgeSpec
 
 
 def delete_root_index(subgraph: TorchSubGraphBatchData):

diff --git a/agl/python/examples/kcan_movielens/readme.md b/agl/python/examples/kcan_movielens/readme.md
@@ -23,21 +23,21 @@
 
 首先我们要把原始数据压缩成子图(pb string)的形式，使用如下data_process/submit.sh的命令。
 
-由于link模式的样本量巨大，用户需要搭建spark集群运行。对于无法搭建集群的用户，可以从上面的链接中下载预先采样的子图数据part-subgraph_kcan_train_test.csv，放在data_process/output_graph_feature目录下
+由于link模式的样本量巨大，用户需要100G内存的机器运行spark任务，用户需要修改[start_docker_with_image.sh](../../../../docker/start_docker_with_image.sh)给虚拟机分配100G内存，同时修改[run_spark_template.sh](../run_spark_template.sh)配置spark.executor.memory=90g和spark.driver.memory=90g。
+对于缺少资源的用户，可以从上面的链接中下载预先采样的子图数据part-subgraph_kcan_train_test.csv，放在data_process/output_graph_feature目录下
 
 ```
 base=`dirname "$0"`
 cd "$base"
 
 python ../../run_spark.py \
-    --mode yarn \
     --jar_resource_path ../../../../java/target/flatv3-1.0-SNAPSHOT.jar \
     --input_edge_table_name ./edge_table.csv \
     --input_label_table_name ./link_table.csv \
     --input_node_table_name ./node_table.csv \
     --output_table_name_prefix ./output_graph_feature \
     --neighbor_distance 2 \
-	--sample_condition 'random_sampler(limit=20, seed=34, replacement=false)' \
+    --sample_condition 'random_sampler(limit=20, seed=34, replacement=false)' \
     --subgraph_spec "{'node_spec':[{'node_name':'default','id_type':'string','features':[{'name':'node_feature','type':'dense','dim':1,'value':'int64'}]}],'edge_spec':[{'edge_name':'default','n1_name':'default','n2_name':'default','id_type':'string','features':[{'name':'edge_feature','type':'dense','dim':1,'value':'int64'}]}]}" \
     --algorithm kcan
 ```