TuGraph-family · stevenHust · Sep 5, 2023 · Sep 1, 2023 · Sep 1, 2023 · Sep 1, 2023
diff --git a/agl/java/pom.xml b/agl/java/pom.xml
@@ -139,10 +139,6 @@
                             <pattern>org.apache.commons.cli</pattern>
                             <shadedPattern>agl.apache.commons.cli</shadedPattern>
                         </relocation>
-                        <relocation>
-                            <pattern>org.apache.hadoop</pattern>
-                            <shadedPattern>agl.apache.hadoop</shadedPattern>
-                        </relocation>
                         <relocation>
                             <pattern>javax.servlet</pattern>
                             <shadedPattern>agl.servlet</shadedPattern>

diff --git a/doc/core/sampler/0_data_preparation.md b/doc/core/sampler/0_data_preparation.md
@@ -8,7 +8,7 @@ AGL支持异构属性特征图，图的基本元素为点和边，点和边都
 
 点和边的类型均采用String描述。
 
-点和边可以拥有特征，均采用String描述。特征分为四类：Dense特征，SparseKV特征，SparseK特征，Binary特征；他们的格式如下：
+点和边可以拥有特征，均采用String描述。特征分为四类：Dense特征，SparseKV特征，SparseK特征（后续扩展Binary特征）；他们的格式如下：
 
 | 特征类型     | 配置项说明                  |                     举例说明                            |
 | ----------- | ------------------------- | ----------------------------------------------------- |
@@ -96,7 +96,6 @@ AGL支持异构属性特征图，图的基本元素为点和边，点和边都
      }
   ],
   'label': {
-     'type': 'node',
      'attr': [
        {
          'field': 'time',
@@ -122,16 +121,15 @@ AGL支持异构属性特征图，图的基本元素为点和边，点和边都
 #### features
 列表中每个元素定义了一类特征的格式：
   * name表示特征名称，
-  * type表示特征类型，可选项为：dense，sparse_kv，sparse_k，binary
-  * dim：dense类型下dim为元素个数，sparse_k和sparse_kv类型下dim为最大key值+1
-  * key的dtype取值范围：int64
-  * value的dtype取值范围：float32,float64,int64
+  * type表示特征类型，可选项为：dense，sparse_kv，sparse_k（后续扩展binary）
+  * dim：dense类型下dim为元素个数，sparse_kv和sparse_k类型下dim为最大key值+1
+  * key的dtype取值范围：int64（dense类型不适用）
+  * value的dtype取值范围：float32,float64,int64（sparse_k类型不适用）
 
 #### edge_attr
 定义了边上的属性的名称和对应类型
 
 #### label
-定义了样本的类型，可选范围为：点级别，边级别，图级别。
 定义了样本的属性的名称和对应类型
 
 #### 点表
@@ -150,14 +148,14 @@ AGL支持异构属性特征图，图的基本元素为点和边，点和边都
  ``` 
 
 #### 边表
-| node1_id   | node2_id   | edge_feature    |    type   |
-| ---------- | ---------- | --------------- | --------- |
-| user1      | item1      |      0 1 3      |   click   |
-| user2      | item1      |      0 2        |   click   |
-| user3      | item2      |      1          |   click   |
-| user2      | item3      |      2 3        |   click   |
-| user1      | user2      |                 |  friends  |
-| user2      | user1      |                 |  friends  |
+| node1_id   | node2_id   | edge_id | edge_feature    |    type   |
+| ---------- | ---------- | ------- | --------------- | --------- |
+| user1      | item1      |    e1   |      0 1 3      |   click   |
+| user2      | item1      |    e2   |      0 2        |   click   |
+| user3      | item2      |    e3   |      1          |   click   |
+| user2      | item3      |    e4   |      2 3        |   click   |
+| user1      | user2      |    e5   |                 |  friends  |
+| user2      | user1      |    e6   |                 |  friends  |
 
 ## 样本表说明
 通常在训练阶段，样本节点只占全图节点的10%或者更少，训练阶段仅需产出有Label节点的图样本即可（预测阶段才有可能需要全量节点的图样本）。
@@ -173,12 +171,12 @@ AGL支持异构属性特征图，图的基本元素为点和边，点和边都
 | 3       | 3       |   1 0    |     15    |     5     |
  ```
 必需列为seed,node_id,label。
-系统把node_id作为种子，利用边表传播k次得到k跳邻居subgraph，作为新列输出。
+系统把node_id作为种子，利用边表传播k次得到k跳邻居graph_feature，作为新列输出。
 其他列other1,other2,...，保持原样输出。
  ``` 
 结果表如下：
 
-| seed    | node_id | label    |  other1   |  other2   |  subgraph     |
+| seed    | node_id | label    |  other1   |  other2   | graph_feature |
 | ------- | ------- | -------- | --------- | --------- | ------------- |
 | 1       | 1       |   0 1    |     23    |     2     | 1的序列化子图   |
 | 3       | 3       |   1 0    |     15    |     5     | 3的序列化子图   |
@@ -192,27 +190,28 @@ AGL支持异构属性特征图，图的基本元素为点和边，点和边都
 | l_3_5   | 3        | 5          |   1 0    |     15   |
  ```
 必需列为seed,node1_id,node2_id,label。
-系统把node1_id，node2_id的并集去重后作为种子集合，利用边表传播k次得到k跳邻居subgraph，作为新列输出
+系统把node1_id，node2_id的并集去重后作为种子集合，利用边表传播k次得到k跳邻居graph_feature，作为新列输出
 其他列other1,...，保持原样输出
  ``` 
-如果配置了merge_subgraph=True（例如KCan模型），结果表如下：
+如果需要融合node1_id和node2_id的子图（例如KCan模型），结果表如下：
 
-| seed    | node1_id | node2_id   | label    |  other1   |  subgraph     |
+| seed    | node1_id | node2_id   | label    |  other1   | graph_feature |
 | ------- | -------- | ---------- | -------- | --------- | ------------- |
 | l_1_3   | 1        | 3          |   0 1    |     23    | 1和3的序列化子图 |
 | l_3_5   | 3        | 5          |   1 0    |     15    | 3和5的序列化子图 |
+融合后的子图会有2个根节点，顺序为node1_id,node2_id
 
-如果配置了merge_subgraph=False（例如CD-GNN模型），结果表如下：
+如果不需要融合node1_id和node2_id的子图（例如CD-GNN模型），结果表如下：
 
-| seed    | node1_id | node2_id   | label    |  other1   |  subgraph     |  subgraph1   |
+| seed    | node1_id | node2_id   | label    |  other1   | graph_feature | graph_feature_2 |
 | ------- | -------- | ---------- | -------- | --------- | ------------- | ------------ |
 | l_1_3   | 1        | 3          |   0 1    |     23    | 1的序列化子图   | 3的序列化子图   |
 | l_3_5   | 3        | 5          |   1 0    |     15    | 3的序列化子图   | 5的序列化子图   |
 
 
 ## 子图级采样
 在ppi等数据里多个节点组成一个独立的小图，为了避免每个节点k跳子图的大量重复采样和计算，
-可以从多个节点一并传播k跳并融合为一个subgraph，在此过程中可以进行去重。
+可以从多个节点一并传播k跳并融合为一个graph_feature，在此过程中可以进行去重。
 
 ### 子图级采样 - 点表征
 针对以节点为单位的图学习模型，即单个节点有一个label。
@@ -227,12 +226,12 @@ AGL支持异构属性特征图，图的基本元素为点和边，点和边都
 | 5      | g1         |   0 1    |     10   |
  ```
 必需列为node_id,graph_id,label。
-系统把node_id节点为种子集合，利用边表传播k次得到k跳邻居，最后按graph_id融合为一个subgraph，作为新列输出
+系统把node_id节点为种子集合，利用边表传播k次得到k跳邻居，最后按graph_id融合为一个graph_feature，作为新列输出
 label和其他列other1,...，按graph_id组合为列表，同时保持了和node_id顺序的对应关系
  ``` 
 结果表如下：
 
-| node_id   |   seed     | label            |  other1    |   subgraph   | 
+| node_id   |   seed     | label            |  other1    | graph_feature | 
 | --------- | ---------- | ---------------- | ---------- | ------------ |
 | 1 3 5     | g1         | [0 1, 1 0, 0 1]  | [3, 9, 10] | g1的序列化子图 |
 | 2 4       | g2         | [1 0, 0 1]       | [23, 7]    | g2的序列化子图 |
@@ -248,54 +247,12 @@ label和其他列other1,...，按graph_id组合为列表，同时保持了和nod
 | 2 4 6      | g2         |   1 0    |        15       |
  ```
 必需列为node_id,graph_id,label。
-系统把node_id里的每个节点为种子集合，利用边表传播k次得到k跳邻居，并按graph_id融合为一个subgraph，作为新列输出
+系统把node_id里的每个节点为种子集合，利用边表传播k次得到k跳邻居，并按graph_id融合为一个graph_feature，作为新列输出
 其他列other1,...，保持原样输出
  ``` 
 结果表如下：
 
-| node_id   |   seed     | label    |  other_feature  |   subgraph   |
-| --------- | ---------- | -------- | --------------- | ------------ |
-| 1 3 5     | g1         |   0 1    |        23       | g1的序列化子图 |
-| 2 4 6     | g2         |   1 0    |        15       | g2的序列化子图 |
-
-## 多分区数据
-有些模型（例如ST-GNN）点表、边表包含多个分区（例如每周的数据对应着一个分区，存储着多周数据）。
-样本表也包含多个分区，同一个种子节点可以在不同分区用于不同的label，其对应的子图数据是基于最近k个分区（包含本分区）的图数据进行多跳传播聚合的。
-
-举例输入的点表和边表包含3个分区：
-
-|  node_id  |  node_feature   | partition |
-| --------- | --------------- | --------- |
-|     A     |     1.0 1.3     |     1     |
-|     B     |     0.3 0.34    |     1     |
-|     A     |     1.3 0.5     |     2     |
-|     B     |     3.1 6.3     |     2     |
-|     A     |     0.2 0.4     |     3     |
-|     B     |     0.4 1.3     |     3     |
-
-
-
-|  node1_id |  node2_id | edge_feature | partition |
-| --------- | --------- | ------------ | --------- |
-|     A     |     B     |     0 1 3    |     1     |
-|     B     |     A     |     0 2      |     1     |
-|     C     |     B     |     1        |     2     |
-|     B     |     C     |     2 3      |     2     |
-|     A     |     C     |     3        |     3     |
-|     B     |     A     |     2        |     3     |
-
-样本表如下，一条样本对应的子图数据是基于最近2个分区的图数据进行多跳传播聚合的：
-
-|  seed  | node_id |  label   |  other1   | partition |
-| ------ | ------- | -------- | --------- | --------- |
-| A@2    |    A    |   0 1    |     23    |     2     |
-| A@3    |    A    |   0 1    |     13    |     3     |
-| B@2    |    B    |   1 0    |     12    |     2     |
-
-输出的结果表如下：
-
-|  seed  | node_id |  label   |  other1   | partition |  subgraph    |  subgraph1  |
-| ------ | ------- | -------- | --------- | --------- | ------------ | ----------- |
-| A@2    |    A    |   0 1    |     23    |     2     | A在分区1的子图 | A在分区2的子图 |
-| A@3    |    A    |   0 1    |     13    |     3     | A在分区2的子图 | A在分区3的子图 |
-| B@2    |    B    |   1 0    |     12    |     2     | B在分区1的子图 | B在分区2的子图 |
+| node_id   |   seed  | label  |  other_feature  | graph_feature |
+| --------- | ------- | ------ | --------------- | ------------ |
+| 1 3 5     | g1      |   0 1  |        23       | g1的序列化子图 |
+| 2 4 6     | g2      |   1 0  |        15       | g2的序列化子图 |
diff --git a/doc/core/sampler/1_demo_of_subgraph_sampling.md b/doc/core/sampler/1_demo_of_subgraph_sampling.md
@@ -1,9 +1,9 @@
 # 快速开始
-在项目的example目录下有多种图模型运行案例，下面我们以Geniepath为例在PPI数据集上介绍如何快速上手子图采样。
+在项目的example目录下有多种图模型运行案例，下面我们以drgst为例在ind.citeseer数据集上介绍如何快速上手子图采样。
 ## 图数据准备
 
 ### 图数据格式
-PPI图数据点特征为dim=50的Dense Float列表，边上没有特征。图数据格式如下：
+ind.citeseer图数据点特征为SparkKV特征，边上没有特征。图数据格式如下：
  ``` 
 {
   'node_spec': [
@@ -12,9 +12,10 @@ PPI图数据点特征为dim=50的Dense Float列表，边上没有特征。图数
       'id_type': 'string',
       'features': [
         {
-          'name': 'node_f',
-          'type': 'dense',
-          'dim': 50,
+          'name': 'sparse_kv',
+          'type': 'kv',
+          'dim': 3703,
+          'key': 'uint32',
           'value': 'float32'
         }
       ]
@@ -26,84 +27,81 @@ PPI图数据点特征为dim=50的Dense Float列表，边上没有特征。图数
       'n1_name': 'default',
       'n2_name': 'default',
       'id_type': 'string',
-      'features': []
+      'features': [
+      ]
     }
   ]
 }
  ``` 
 
 json线上format工具：http://jsonviewer.stack.hu/
-![](../imgs/json_viewer.png)
+![](../../imgs/json_viewer.png)
 format和 remove white space非常好用，建议json中字符串使用单引号，避免转义麻烦。
 format便于观看和编辑，remove white space便于粘贴进配置项或者代码之中。
 
-###### TODO:  'node_name': 'default', 'n1_name': 'default',这种配置作为默认值是否不显式配置
-
 ### 输入数据表
 举例输入的点表：
 
-|  node_id   |    node_feature     |
-| ---------- | ------------------- |
-|     1      |   1.0 2.3 ... 1.3   |
-|     2      |   0.3 3.2 ... 0.34  |
-|     3      |   1.3 0.9 ... 0.5   |
-|     4      |   3.1 7.4 ... 6.3   |
+|  node_id   |                        node_feature                        |
+| ---------- | ---------------------------------------------------------- |
+|     0      |   184:0.032258063554763794 ... 3647:0.032258063554763794   |
+|     1      |   82:0.03030303120613098 ... 3640:0.03030303120613098      |
+|     2      |   44:0.03999999910593033 ... 3644:0.03999999910593033      |
 
 举例输入的边表：
 
-|  node1_id  |  node2_id  | edge_feature |
-| ---------- | ---------- | ------------ |
-|     1      |     0      |              |
-|     0      |     1      |              |
-|     1      |     3      |              |
-|     2      |     10     |              |
+|  node1_id  |  node2_id  | edge_id |
+| ---------- | ---------- | ------- |
+|    628     |     0      |  628_0  |
+|    158     |     1      |  158_1  |
+|    486     |     1      |  486_1  |
 
 样本表如下：
 
-|  seed  |  node_id  |    label   |  train_flag   |
-| ------ | --------- | ---------- | ------------- |
-|    0   |     0     |  0 0 ... 1 |     train     |
-|    2   |     2     |  1 1 ... 0 |     eval      |
-|    5   |     5     |  1 0 ... 0 |     test      |
+|  node_id  |  seed  |    label     | train_flag  |
+| --------- | ------ | ------------ | ----------- |
+|    0      |    0   |  0 0 0 1 0 0 |    train    |
+|    1      |    1   |  0 1 0 0 0 0 |    eval     |
+|    2      |    2   |  0 0 0 0 0 1 |    test     |
 
 ## 运行Spark生成子图样本
 
 用户配置spark本地运行命令如下(目前只支持spark3.0.3及以上版本)：
  ``` 
-/path_to/spark-3.1.1-odps0.34.1/bin/spark-submit  --master local --class com.alipay.alps.flatv3.spark.NodeLevelSampling \
+spark-submit  --master local --class com.alipay.alps.flatv3.spark.NodeLevelSampling \
     /path_to/agl.jar hop=2 \
-    subgraph_spec="{'node_spec':[{'node_name':'default','id_type':'string','features':[{'name':'node_f','type':'dense','dim':50,'value':'float32'}]}],'edge_spec':[{'edge_name':'default','n1_name':'default','n2_name':'default','id_type':'string','features':[]}]}"  \
-    sample_cond='random_sampler(limit=10, replacement=false)'   \
-    input_node_feature="file:////path_to/ppi_node_table.csv" \
-    input_edge="file:////path_to/ppi_edge_table.csv" \
-    input_label="file:////path_to/ppi_label.csv" \
+    subgraph_spec="{'node_spec':[{'node_name':'default','id_type':'string','features':[{'name':'sparse_kv','type':'kv','dim':3703,'key':'uint32','value':'float32'}]}],'edge_spec':[{'edge_name':'default','n1_name':'default','n2_name':'default','id_type':'string','features':[]}]}"  \
+    sample_cond='random_sampler(limit=100, replacement=false)'   \
+    input_node_feature="file:////path_to/node_table.csv" \
+    input_edge="file:////path_to/edge_table.csv" \
+    input_label="file:////path_to/label.csv" \
     output_results='file:////path_to/output_subgraph' 2>&1 | tee logfile.txt
  ``` 
 
 
 ### 配置说明
 
 |                            配置                           |                 说明              |
-| -------------------------------------------------------- | --------------------------------- |
-|                      --master local                      |        spark本地运行模式            |
-|  --class com.alipay.alps.flatv3.spark.NodeLevelSampling                |      spark程序入口：子图采样         |
-|                           hop=2                          |            进行2跳邻居采样          |
-|                       subgraph_spec                      |            定义图数据格式           |
-|sample_cond="random_sampler(limit=10, replacement=false)" | 限制每个节点无放回的采样最多10个邻居节点 |
-| input_node_feature="file:////path_to/ppi_node_table.csv" |    file:///前缀表示后续接着本地路径   |
+| --------------------------------------------------------- | --------------------------------- |
+|                      --master local                       |        spark本地运行模式          |
+|  --class com.alipay.alps.flatv3.spark.NodeLevelSampling   |      spark程序入口：子图采样      |
+|                           hop=2                           |            进行2跳邻居采样          |
+|                       subgraph_spec                       |            定义图数据格式           |
+|sample_cond="random_sampler(limit=100, replacement=false)" | 限制每个节点无放回的采样最多10个邻居节点 |
+| input_node_feature="file:////path_to/node_table.csv"      |    file:///前缀表示后续接着本地路径   |
 
 ### 图采样整体流程
 下图展示了2跳子图结构采样的扩展过程：
 
-![](../imgs/join_graph_structure.png)
+![](../../imgs/join_graph_structure.png)
 
 得到子图结构，根节点依赖的点、边信息后，再join点、边特征，生成子图样本。
 ### 结果数据说明
 
 输出的结果表如下：
 
-|  node_id   |    label   |  train_flag  |  subgraph  |
-| ---------- | ---------- | ------------ | ---------- |
-|     0      |  0 0 ... 1 |     train    |   0的子图   |
-|     2      |  1 1 ... 0 |     eval     |   2的子图   |
-|     5      |  1 0 ... 0 |     test     |   5的子图   |
+|  node_id   |     label    |  train_flag  |  subgraph  |
+| ---------- | ------------ | ------------ | ---------- |
+|     0      |  0 0 0 1 0 0 |     train    |   0的子图  |
+|     1      |  0 1 0 0 0 0 |     eval     |   1的子图  |
+|     2      |  0 0 0 0 0 1 |     test     |   2的子图  |