diff --git a/docs/en/tutorial/index.rst b/docs/en/tutorial/index.rst index fbbd84eda26..4815ca95cb2 100644 --- a/docs/en/tutorial/index.rst +++ b/docs/en/tutorial/index.rst @@ -5,12 +5,14 @@ Tutorials .. toctree:: :maxdepth: 1 - standalone_vs_cluster - modes + data_import_guide tutorial_sql_1 tutorial_sql_2 - data_import openmldbspark_distribution + data_import + data_export autofe - common_architecture + standalone_vs_cluster + standalone_use + app_arch online_offline_sync diff --git a/docs/en/tutorial/tutorial_sql_1.md b/docs/en/tutorial/tutorial_sql_1.md index b6bcd5530e8..c94df66c086 100644 --- a/docs/en/tutorial/tutorial_sql_1.md +++ b/docs/en/tutorial/tutorial_sql_1.md @@ -1,7 +1,7 @@ # SQL for Feature Extraction (Part 1) -## 1. The Feature Engineering of Machine Learning +## 1. Feature Engineering for Machine Learning A real-world machine learning application generally includes two main processes, namely **Feature Engineering** and **Machine Learning Model** (hereinafter referred to as **Model**). We must know a lot about the model, from the classic logistic regression and decision tree models to the deep learning models, we all focus on how to develop high-quality models. We may pay less attention to feature engineering. However, as the saying goes, data and features determine the upper limit of machine learning, while models and algorithms only approach this limit. It can be seen that we have long agreed on the importance of Feature Engineering. @@ -59,7 +59,7 @@ For example, the following user transaction table (hereinafter referred as data | trans_type | STRING | Transaction Type | | province | STRING | Province | | city | STRING | City | -| label | BOOL | Sample label, true\|false | +| label | BOOL | Sample label, `true` or `flase` | In addition to the primary table, there may also be tables storing relevant auxiliary information in the database, which can be combined with the primary table through the JOIN operation. These tables are called **Secondary Tables** (note that there may be multiple secondary tables). For example, we can have a secondary table storing the merchants' history flow. In the process of feature engineering, more valuable information can be obtained by combining the primary and secondary tables. The feature engineering over multiple tables will be introduced in detail in the [next part](tutorial_sql_2.md) of this series. @@ -143,6 +143,7 @@ Important parameters include: - The lower bound time must be `>=` the upper bound time. - The lower bound row must follow the upper bound row. +For more features, pleaes referr to [documentation](../openmldb_sql/dql/WHERE_CLAUSE.md). #### Example @@ -150,9 +151,9 @@ For the transaction table T1 shown above, we define two `ROWS_RANGE` windows and ![img](images/table_t1.png) -Note that the following window definitions are not completed SQL. We will add aggregate functions later to complete runnable SQL. +Note that the following window definitions are not completed SQL. We will add aggregate functions to complete runnable SQL. (See [3.3.2](332-step-2constructfeaturesbasedontimewindow)) -- w1d: the window within the most recent day +**w1d: the window within the most recent day** The window of the user's most recent day containing the rows from the current to the most recent day ```sql window w1d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) @@ -160,14 +161,14 @@ window w1d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 1d PRECED The `w1d` window shown in the above figure is for the partition `id=9`, and the `w1d` window contains three rows (`id=6`, `id=8`, `id=9`). These three rows fall in the time window [2022-02-07 12:00:00, 2022-02-08 12:00:00] . -- w1d_10d: the window from 1 day ago to the last 10 days +**w1d_10d: the window from 1 day ago to the last 10 days** ```sql window w1d_10d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 10d PRECEDING AND 1d PRECEDING) ``` The window `w1d_10d` for the partition `id=9` contains three rows, which are `id=1`, `id=3` and `id=4`. These three rows fall in the time window of [2022-01-29 12:00:00, 2022-02-07 12:00:00]。 -- w0_1: the window contains the last 0 ~ 1 rows +**w0_1: the window contains the last 0 ~ 1 rows** The window contains the last 0 ~ 1 rows, including the previous line and the current line. ```sql window w0_1 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) @@ -175,7 +176,7 @@ window w0_1 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 1 PRECEDING AN The window `w0_1` for the partition `id=10` contains 2 rows, which are `id=7` and `id=10`. -- w2_10: the window contains the last 2 ~ 10 rows +**w2_10: the window contains the last 2 ~ 10 rows** ```sql window w2_10 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 10 PRECEDING AND 2 PRECEDING) @@ -304,7 +305,7 @@ window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PREC We make frequency statistics for a given column as we may need to know the type of the highest frequency, the proportion of the type with the largest number, etc., in each category. -`top1_ratio`: Find out the type with the largest number and compute the proportion of its number in the window. +**`top1_ratio`**: Find out the type with the largest number and compute the proportion of its number in the window. The following SQL uses `top1_ratio` to find out the city with the most transactions in the last 30 days and compute the proportion of the number of transactions of the city to the total number of transactions in t1. ```sql @@ -314,7 +315,7 @@ FROM t1 window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PRECEDING AND CURRENT ROW); ``` -`topn_frequency(col, top_n)`: Find the `top_n` categories with the highest frequency in the window +**`topn_frequency(col, top_n)`**: Find the `top_n` categories with the highest frequency in the window The following SQL uses `topn_frequency` to find out the top 2 cities with the highest number of transactions in the last 30 days in t1. ```sql diff --git a/docs/en/tutorial/tutorial_sql_2.md b/docs/en/tutorial/tutorial_sql_2.md index bb69147c065..cc7ab8261ad 100644 --- a/docs/en/tutorial/tutorial_sql_2.md +++ b/docs/en/tutorial/tutorial_sql_2.md @@ -63,7 +63,7 @@ As shown below, left table `LAST JOIN` right table with `ORDER BY` and right tab ## 3. Multi-Row Aggregation over Multiple Tables -For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../reference/sql/dql/WINDOW_CLAUSE.md#window-union) syntax. +For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) syntax. WINDOW UNION supports combining multiple pieces of data from the secondary table to form a window on secondary table. Based on the time window, it is convenient to construct the multi-row aggregation feature of the secondary table. Similarly, two steps need to be completed to construct the multi-row aggregation feature of the secondary table: @@ -122,10 +122,10 @@ Among them, necessary elements include: - Lower bound time must be > = Upper bound time - The row number of lower bound must be < = The row number of upper bound - `INSTANCE_NOT_IN_WINDOW`: It indicates that except for the current row, other data in the main table will not enter the window. -- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../reference/sql/dql/WINDOW_CLAUSE.md). +- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../openmldb_sql/sql/dql/WINDOW_CLAUSE.md). ``` -### Example +#### Example Let's see the usage of WINDOW UNION through specific examples. @@ -166,7 +166,7 @@ PARTITION BY mid ORDER BY purchase_time ROWS_RANGE BETWEEN 10d PRECEDING AND 1 PRECEDING INSTANCE_NOT_IN_WINDOW) ``` -## 3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table +### 3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table Apply the multi-row aggregation function on the created window to construct aggregation features on multi-rows of secondary table, so that the number of rows finally generated is the same as that of the main table. For example, we can construct features from the secondary table like: the total retail sales of merchants in the last 10 days `w10d_merchant_purchase_amt_sum` and the total consumption times of the merchant in the last 10 days `w10d_merchant_purchase_count`. diff --git a/docs/zh/tutorial/tutorial_sql_1.md b/docs/zh/tutorial/tutorial_sql_1.md index aa73927ace7..bbe618bf384 100644 --- a/docs/zh/tutorial/tutorial_sql_1.md +++ b/docs/zh/tutorial/tutorial_sql_1.md @@ -144,6 +144,7 @@ window window_name as (PARTITION BY partition_col ORDER BY order_col ROWS_RANGE - OpenMLDB 的下界条数必须<=上界条数 更多语法和特性可以参考 [OpenMLDB窗口参考手册](../openmldb_sql/dql/WHERE_CLAUSE.md)。 + #### 示例 对于上面所示的交易表 t1,我们定义两个时间窗口和两个条数窗口。每一个样本行的窗口均按用户ID(`uid`)分组,按交易时间(`trans_time`)排序。下图展示了分组排序后的数据。 ![img](images/table_t1.jpg) @@ -240,7 +241,7 @@ xxx_cate(col, cate) over w - 参数`col`:参与聚合计算的列。 - 参数`cate`:分组列。 -目前支持的带有 _cate 后缀的聚合函为:`count_cate`, `sum_cate`, `avg_cate`, `max_cate`, `min_cate` +目前支持的带有 `_cate` 后缀的聚合函为:`count_cate`, `sum_cate`, `avg_cate`, `max_cate`, `min_cate` 相关示例如下: diff --git a/docs/zh/tutorial/tutorial_sql_2.md b/docs/zh/tutorial/tutorial_sql_2.md index 6e1658ad228..913b10a161d 100644 --- a/docs/zh/tutorial/tutorial_sql_2.md +++ b/docs/zh/tutorial/tutorial_sql_2.md @@ -64,7 +64,7 @@ SELECT * FROM s1 LAST JOIN s2 ORDER BY s2.std_ts ON s1.col1 = s2.col1; ## 3. 副表多行聚合特征 -OpenMLDB 针对副表拼接场景,扩展了标准的 WINDOW 语法,新增了 [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#windowunion) 的特性,支持从副表拼接多条数据形成副表窗口。在副表拼接窗口的基础上,可以方便构建副表多行聚合特征。同样地,构造副表多行聚合特征也需要完成两个步骤: +OpenMLDB 针对副表拼接场景,扩展了标准的 WINDOW 语法,新增了 [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) 的特性,支持从副表拼接多条数据形成副表窗口。在副表拼接窗口的基础上,可以方便构建副表多行聚合特征。同样地,构造副表多行聚合特征也需要完成两个步骤: - 步骤一:定义副表拼接窗口。 - 步骤二:在副表拼接窗口上构造副表多行聚合特征。