Skip to content

Commit

Permalink
[sync] 1011 sync code (#6)
Browse files Browse the repository at this point in the history
* add lagcl
* add english
* fix setup.py
* modify spark conf
* optimize merit, pagnn training speed
* format
  • Loading branch information
zdlant authored Oct 17, 2023
1 parent f62b619 commit 1bb2a13
Show file tree
Hide file tree
Showing 79 changed files with 3,837 additions and 444 deletions.
121 changes: 69 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](./LICENSE)

Ant Graph Learning (AGL) 为工业级大规模图学习任务提供全链路解决方案。
[中文文档](./README_CN.md)

Ant Graph Learning (AGL) provides a comprehensive solution for graph learning tasks at an industrial scale.

[//]: # (<div align="center">)

Expand All @@ -14,57 +16,72 @@ Ant Graph Learning (AGL) 为工业级大规模图学习任务提供全链路解

[//]: # (</div>)

![](./doc/core/architecture.png)

工业级图学习任务具有以下特点:

* 图数据复杂:
* 图数据规模大:典型有 十亿点,百亿边,亿级别样本。
* 数据依赖:一个点的 embedding 依赖周围点/边的 embedding
* 类型丰富: 同质/异质/动态图
* 任务类型复杂
* 离线:离线训练,离线批量预测,离线全图预测
* 在线:在线训练,在线预测(需要与离线结果一致)
* 使用方式/场景复杂:
* 多租户
* 使用方式多变:GNN-only,GNN+搜推广/多模态
* 异构资源:CPU/GPU cluster

AGL应对这些问题的思路:

* 图规模
* 图训练:训练时由大图转换为小图,解决数据依赖问题
* 扩展性
* 图采样:条件过滤(索引) + 采样(随机/概率、TopK)
* 图表达:graph feature 能够表达 同质/异质/动态图;支持 node/edge/graph level 子图;支持只存储结构
* 图训练:解除图数据的数据依赖问题,可以复用成熟的DNN训练架构(如 PS, AllReduce) 进行大规模分布式训练
* 稳定性
* 复用成熟的 Spark or MapReduce (图样本阶段), 以及 DNN 链路基础设施的弹性与容错能力
* 一致性
* 样本一致性:图样本离线生成,在/离线预测可复用
* 资源成本
* graph feature 可存储在磁盘上,减少对内存的需求

基于这样的考量,AGL设计了图数据构建以及学习方案,可以在普通的集群上完成大规模图学习任务:

- 图样本:AGL通过 Spark (MR) 预先抽取目标节点的 k阶邻域信息,作为 GraphFeature。


- 图训练:训练阶段提供解析逻辑,把 GraphFeature 转换为模型所需的临接矩阵,点特征矩阵,边特征矩阵等信息。
通过这种将图学习任务无缝衔接到普通DNN的学习模式上,能够方便复用普通DNN模式中各种成熟的技术和基础设施。

目前AGL以Pytorch为后端,同时对接了开源算法库(PyG), 以减少用户开发负担。同时AGL针对复杂的图数据(同质/异质/动态图),沉淀了丰富的自研图算法(点分类/边预测/表征学习等)。

# 如何使用

* [安装说明](doc/core/install.md)
* [流程说明](doc/core/process_description.md)
* [构建图样本](doc/core/sampler/0_data_preparation.md)
* [图学习教程](doc/core/graph_learning_tutorial.md)

# 如何贡献代码

* [Contribution Guidelines](doc/core/contribution.md)
![](doc/core/English/images/architecture_EN.png)

Graph learning tasks in industrial settings exhibit the following characteristics:

* Complex graph data:
* Large-scale graphs: typically consisting of billions of nodes, tens of billions of edges, and millions of samples.
* Data dependencies: The computation of a node's embedding relies on the embeddings of its neighboring nodes/edges.
* Diverse types: homogeneous/heterogeneous/dynamic graph.
* Complex task types:
* Offline: offline training, offline batch prediction, offline full-graph prediction.
* Online: online training, online prediction (consistent with offline results).
* Complex usage/scenarios:
* Multi-tenancy.
* Diverse usage scenarios: GNN-only, GNN + search and recommendation/multi-modal.
* Heterogeneous resources: CPU/GPU clusters.

AGL addresses these challenges by adopting the following approaches:

* The Graph scale issue:
* AGL tackles the problem of data dependencies by transforming large graphs
into smaller subgraphs in advance.
* Scalability/Extensibility:
* Graph sampling: conditional filtering (index) + sampling (random/probabilistic, TopK).
* Graph representation: AGL provides a graph-feature format that is capable of representing homogeneous,
heterogeneous, and dynamic graphs. Additionally, it supports node-level, edge-level, and graph-level subgraphs,
allowing for more granular analysis and learning. Furthermore, AGL provides the option to store only the structure
of the graph,
which can be beneficial for certain use cases.
* Graph training: AGL resolves the data dependency problem inherent in graph data, facilitating
large-scale distributed training through the utilization of mature deep neural network (DNN) training
architectures such as PS (Parameter Server) and AllReduce. These architectures enable efficient and scalable
training processes, ensuring the seamless handling of graph data on a distributed scale.
* Stability:
* Reuse mature Spark or MapReduce (graph sampling phase) and DNN infrastructure for elasticity and fault tolerance.
* Consistency:
* Sample consistency: graph samples generated offline can be reused for online/offline prediction.
* Resource cost:
* Graph features can be stored on disk, thereby reducing the memory requirements.

Based on these considerations, AGL has developed comprehensive solutions for graph data construction and learning,
enabling the completion of large-scale graph learning tasks on regular machines or clusters:

* Graph sampling:
* AGL leverages Spark (or MR) to pre-extract k-hop neighborhood information of target nodes as graph features.
* Graph training:
* During the training phase, AGL incorporates parsing logic to convert graph features into essential components such
as the adjacency matrix, node feature matrix, and edge feature matrix, along with other necessary information for
the model. This seamless integration of graph learning tasks into the regular DNN learning mode allows for the
convenient reuse of mature technologies and infrastructure typically used in standard DNN workflows.

AGL currently employs PyTorch as its backend and integrates open-source algorithm libraries like PyG to ease the
development process for users.
Furthermore, AGL has developed some in-house graph algorithms, including node classification, edge prediction, and
representation learning, specifically tailored for handling complex graph data in various forms such as homogeneous,
heterogeneous, and dynamic graphs.

# How to use

* [Installation Guide](doc/core/English/install_EN.md)
* [Process Workflow](doc/core/English/process_description_EN.md)
* [Generate Graph Samples](doc/core/English/sampler/0_data_preparation_EN.md)
* [Graph Learning Tutorial](doc/core/English/graph_learning_tutorial_EN.md)

# How to Contribute

* [Contribution Guidelines](doc/core/English/contribution_EN.md)

# Cite

Expand Down
92 changes: 92 additions & 0 deletions README_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Ant Graph Learning

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](./LICENSE)

Ant Graph Learning (AGL) 为工业级大规模图学习任务提供全链路解决方案。

[//]: # (<div align="center">)

[//]: # (<img src=./doc/core/architecture.png>)

[//]: # (<br>)

[//]: # (<b>Figure</b>: AGL Overall Architecture)

[//]: # (</div>)

![](doc/core/Chinese/images/architecture.png)

工业级图学习任务具有以下特点:

* 图数据复杂:
* 图数据规模大:典型有 十亿点,百亿边,亿级别样本。
* 数据依赖:一个点的 embedding 依赖周围点/边的 embedding
* 类型丰富: 同质/异质/动态图
* 任务类型复杂
* 离线:离线训练,离线批量预测,离线全图预测
* 在线:在线训练,在线预测(需要与离线结果一致)
* 使用方式/场景复杂:
* 多租户
* 使用方式多变:GNN-only,GNN+搜推广/多模态
* 异构资源:CPU/GPU cluster

AGL应对这些问题的思路:

* 图规模
* 图训练:训练时由大图转换为小图,解决数据依赖问题
* 扩展性
* 图采样:条件过滤(索引) + 采样(随机/概率、TopK)
* 图表达:graph feature 能够表达 同质/异质/动态图;支持 node/edge/graph level 子图;支持只存储结构
* 图训练:解除图数据的数据依赖问题,可以复用成熟的DNN训练架构(如 PS, AllReduce) 进行大规模分布式训练
* 稳定性
* 复用成熟的 Spark or MapReduce (图样本阶段), 以及 DNN 链路基础设施的弹性与容错能力
* 一致性
* 样本一致性:图样本离线生成,在/离线预测可复用
* 资源成本
* graph feature 可存储在磁盘上,减少对内存的需求

基于这样的考量,AGL设计了图数据构建以及学习方案,可以在普通的集群上完成大规模图学习任务:

- 图样本:AGL通过 Spark (MR) 预先抽取目标节点的 k阶邻域信息,作为 GraphFeature。


- 图训练:训练阶段提供解析逻辑,把 GraphFeature 转换为模型所需的邻接矩阵,点特征矩阵,边特征矩阵等信息。
通过这种将图学习任务无缝衔接到普通DNN的学习模式上,能够方便复用普通DNN模式中各种成熟的技术和基础设施。

目前AGL以Pytorch为后端,同时对接了开源算法库(PyG), 以减少用户开发负担。同时AGL针对复杂的图数据(同质/异质/动态图),沉淀了丰富的自研图算法(点分类/边预测/表征学习等)。

# 如何使用

* [安装说明](doc/core/Chinese/install.md)
* [流程说明](doc/core/Chinese/process_description.md)
* [构建图样本](doc/core/Chinese/sampler/0_data_preparation.md)
* [图学习教程](doc/core/Chinese/graph_learning_tutorial.md)

# 如何贡献代码

* [Contribution Guidelines](doc/core/Chinese/contribution.md)

# Cite

```
@article{zhang13agl,
title={AGL: A Scalable System for Industrial-purpose Graph Machine Learning},
author={Zhang, Dalong and Huang, Xin and Liu, Ziqi and Zhou, Jun and Hu, Zhiyang and Song, Xianzheng and Ge, Zhibang and Wang, Lin and Zhang, Zhiqiang and Qi, Yuan},
journal={Proceedings of the VLDB Endowment},
volume={13},
number={12}
}
@inproceedings{zhang2023inferturbo,
title={InferTurbo: A Scalable System for Boosting Full-graph Inference of Graph Neural Network over Huge Graphs},
author={Zhang, Dalong and Song, Xianzheng and Hu, Zhiyang and Li, Yang and Tao, Miao and Hu, Binbin and Wang, Lin and Zhang, Zhiqiang and Zhou, Jun},
booktitle={2023 IEEE 39th International Conference on Data Engineering (ICDE)},
pages={3235--3247},
year={2023},
organization={IEEE Computer Society}
}
```

# License

[Apache License 2.0](LICENSE)
2 changes: 1 addition & 1 deletion agl/python/data/agl_dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

import numpy as np

from pyagl.pyagl import AGLDType
from pyagl import AGLDType

DTypeValue = namedtuple("DTypeValue", ["name", "np_dtype", "c_dtype"])

Expand Down
2 changes: 1 addition & 1 deletion agl/python/data/collate.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import torch
from typing import List

from pyagl.pyagl import (
from pyagl import (
NodeSpec,
EdgeSpec,
)
Expand Down
2 changes: 1 addition & 1 deletion agl/python/data/collate_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import os
import numpy as np

from pyagl.pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec
from pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec
from agl.python.data.collate import AGLHomoCollateForPyG
from agl.python.data.column import AGLDenseColumn, AGLRowColumn
from agl.python.data.subgraph.pyg_inputs import TorchSubGraphBatchData
Expand Down
4 changes: 2 additions & 2 deletions agl/python/data/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def _c_decode(self, data, **kwargs):
if isinstance(data[0], bytes):
# if it is instance of bytes (encoded by utf-8). call multi_dense_decode_bytes
# (implemented with c++) and pass those data to c++ in a zero copy way
from pyagl.pyagl import multi_dense_decode_bytes
from pyagl import multi_dense_decode_bytes

data_bytesarray = [bytearray(data_t) for data_t in data]
res = multi_dense_decode_bytes(
Expand All @@ -107,7 +107,7 @@ def _c_decode(self, data, **kwargs):
res_np_array_list = [np.array(res_i) for res_i in res]
elif isinstance(data[0], str):
# if data is instance of str, passing it from Python to C++ using pybind11 will trigger a copy.
from pyagl.pyagl import multi_dense_decode_string
from pyagl import multi_dense_decode_string

res = multi_dense_decode_string(
data,
Expand Down
2 changes: 1 addition & 1 deletion agl/python/data/multi_graph_feature_collate.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

from typing import List, Union, Callable, Optional, Any, Dict

from pyagl.pyagl import (
from pyagl import (
NodeSpec,
EdgeSpec,
)
Expand Down
2 changes: 1 addition & 1 deletion agl/python/data/subgraph/subgraph.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from typing import List
import numpy as np

from pyagl.pyagl import (
from pyagl import (
AGLDType,
DenseFeatureSpec,
SparseKVSpec,
Expand Down
2 changes: 1 addition & 1 deletion agl/python/data/subgraph/subgraph_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import numpy as np

from agl.python.data.subgraph.subgraph import PySubGraph
from pyagl.pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec
from pyagl import AGLDType, DenseFeatureSpec, SparseKVSpec, NodeSpec, EdgeSpec


class SubGraphTest(unittest.TestCase):
Expand Down
2 changes: 1 addition & 1 deletion agl/python/dataset/dataset_collate_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from agl.python.dataset.map_based_dataset import AGLTorchMapBasedDataset
from agl.python.data.collate import AGLHomoCollateForPyG
from agl.python.data.column import AGLDenseColumn, AGLRowColumn
from pyagl.pyagl import AGLDType, SparseKVSpec, NodeSpec, EdgeSpec
from pyagl import AGLDType, SparseKVSpec, NodeSpec, EdgeSpec


class DatasetAndCollateFnTest(unittest.TestCase):
Expand Down
2 changes: 1 addition & 1 deletion agl/python/examples/drgst/drgst_citeseer.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from agl.python.data.collate import AGLHomoCollateForPyG
from agl.python.data.column import AGLDenseColumn, AGLRowColumn
from agl.python.model.encoder.drgst import DRGSTEncoder
from pyagl.pyagl import (
from pyagl import (
AGLDType,
DenseFeatureSpec,
SparseKVSpec,
Expand Down
2 changes: 1 addition & 1 deletion agl/python/examples/geniepath_ppi/train_geniepath_ppi.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from agl.python.dataset.map_based_dataset import AGLTorchMapBasedDataset
from agl.python.data.collate import AGLHomoCollateForPyG
from agl.python.data.column import AGLRowColumn, AGLMultiDenseColumn
from pyagl.pyagl import (
from pyagl import (
AGLDType,
DenseFeatureSpec,
NodeSpec,
Expand Down
2 changes: 1 addition & 1 deletion agl/python/examples/hegnn_acm/model_hegnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import torch.nn.functional as F

from agl.python.data.column import AGLDenseColumn, AGLRowColumn
from pyagl.pyagl import (
from pyagl import (
AGLDType,
SparseKVSpec,
NodeSpec,
Expand Down
1 change: 0 additions & 1 deletion agl/python/examples/kcan_movielens/data_process/submit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ base=`dirname "$0"`
cd "$base"

python ../../run_spark.py \
--mode yarn \
--jar_resource_path ../../../../java/target/flatv3-1.0-SNAPSHOT.jar \
--input_edge_table_name ./edge_table.csv \
--input_label_table_name ./link_table.csv \
Expand Down
2 changes: 1 addition & 1 deletion agl/python/examples/kcan_movielens/kcan_subgraph_adj.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
)
from agl.python.data.column import AGLRowColumn, AGLMultiDenseColumn
from agl.python.model.encoder.kcan import KCANEncoder
from pyagl.pyagl import AGLDType, DenseFeatureSpec, NodeSpec, EdgeSpec
from pyagl import AGLDType, DenseFeatureSpec, NodeSpec, EdgeSpec


def delete_root_index(subgraph: TorchSubGraphBatchData):
Expand Down
6 changes: 3 additions & 3 deletions agl/python/examples/kcan_movielens/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,21 +23,21 @@

首先我们要把原始数据压缩成子图(pb string)的形式,使用如下data_process/submit.sh的命令。

由于link模式的样本量巨大,用户需要搭建spark集群运行。对于无法搭建集群的用户,可以从上面的链接中下载预先采样的子图数据part-subgraph_kcan_train_test.csv,放在data_process/output_graph_feature目录下
由于link模式的样本量巨大,用户需要100G内存的机器运行spark任务,用户需要修改[start_docker_with_image.sh](../../../../docker/start_docker_with_image.sh)给虚拟机分配100G内存,同时修改[run_spark_template.sh](../run_spark_template.sh)配置spark.executor.memory=90g和spark.driver.memory=90g。
对于缺少资源的用户,可以从上面的链接中下载预先采样的子图数据part-subgraph_kcan_train_test.csv,放在data_process/output_graph_feature目录下

```
base=`dirname "$0"`
cd "$base"
python ../../run_spark.py \
--mode yarn \
--jar_resource_path ../../../../java/target/flatv3-1.0-SNAPSHOT.jar \
--input_edge_table_name ./edge_table.csv \
--input_label_table_name ./link_table.csv \
--input_node_table_name ./node_table.csv \
--output_table_name_prefix ./output_graph_feature \
--neighbor_distance 2 \
--sample_condition 'random_sampler(limit=20, seed=34, replacement=false)' \
--sample_condition 'random_sampler(limit=20, seed=34, replacement=false)' \
--subgraph_spec "{'node_spec':[{'node_name':'default','id_type':'string','features':[{'name':'node_feature','type':'dense','dim':1,'value':'int64'}]}],'edge_spec':[{'edge_name':'default','n1_name':'default','n2_name':'default','id_type':'string','features':[{'name':'edge_feature','type':'dense','dim':1,'value':'int64'}]}]}" \
--algorithm kcan
```
Expand Down
Loading

0 comments on commit 1bb2a13

Please sign in to comment.