Skip to content

Commit

Permalink
docs: updates on tutorial folder (#3754)
Browse files Browse the repository at this point in the history
* update on tutorial folder

* Update tutorial_sql_2.md

* update for links
  • Loading branch information
Elliezza authored Feb 22, 2024
1 parent 9f0d3fc commit 15de293
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 19 deletions.
10 changes: 6 additions & 4 deletions docs/en/tutorial/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ Tutorials
.. toctree::
:maxdepth: 1

standalone_vs_cluster
modes
data_import_guide
tutorial_sql_1
tutorial_sql_2
data_import
openmldbspark_distribution
data_import
data_export
autofe
common_architecture
standalone_vs_cluster
standalone_use
app_arch
online_offline_sync
19 changes: 10 additions & 9 deletions docs/en/tutorial/tutorial_sql_1.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# SQL for Feature Extraction (Part 1)


## 1. The Feature Engineering of Machine Learning
## 1. Feature Engineering for Machine Learning

A real-world machine learning application generally includes two main processes, namely **Feature Engineering** and **Machine Learning Model** (hereinafter referred to as **Model**). We must know a lot about the model, from the classic logistic regression and decision tree models to the deep learning models, we all focus on how to develop high-quality models. We may pay less attention to feature engineering.
However, as the saying goes, data and features determine the upper limit of machine learning, while models and algorithms only approach this limit. It can be seen that we have long agreed on the importance of Feature Engineering.
Expand Down Expand Up @@ -59,7 +59,7 @@ For example, the following user transaction table (hereinafter referred as data
| trans_type | STRING | Transaction Type |
| province | STRING | Province |
| city | STRING | City |
| label | BOOL | Sample label, true\|false |
| label | BOOL | Sample label, `true` or `flase` |

In addition to the primary table, there may also be tables storing relevant auxiliary information in the database, which can be combined with the primary table through the JOIN operation. These tables are called **Secondary Tables** (note that there may be multiple secondary tables). For example, we can have a secondary table storing the merchants' history flow. In the process of feature engineering, more valuable information can be obtained by combining the primary and secondary tables. The feature engineering over multiple tables will be introduced in detail in the [next part](tutorial_sql_2.md) of this series.

Expand Down Expand Up @@ -143,39 +143,40 @@ Important parameters include:
- The lower bound time must be `>=` the upper bound time.
- The lower bound row must follow the upper bound row.

For more features, pleaes referr to [documentation](../openmldb_sql/dql/WHERE_CLAUSE.md).

#### Example

For the transaction table T1 shown above, we define two `ROWS_RANGE` windows and two `ROWS` windows. The windows of each row are grouped by user ID (' uid ') and sorted by transaction time (' trans_time '). The following figure shows the result of grouping and sorting.

![img](images/table_t1.png)

Note that the following window definitions are not completed SQL. We will add aggregate functions later to complete runnable SQL.
Note that the following window definitions are not completed SQL. We will add aggregate functions to complete runnable SQL. (See [3.3.2](332-step-2constructfeaturesbasedontimewindow))

- w1d: the window within the most recent day
**w1d: the window within the most recent day**
The window of the user's most recent day containing the rows from the current to the most recent day
```sql
window w1d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW)
```

The `w1d` window shown in the above figure is for the partition `id=9`, and the `w1d` window contains three rows (`id=6`, `id=8`, `id=9`). These three rows fall in the time window [2022-02-07 12:00:00, 2022-02-08 12:00:00] .

- w1d_10d: the window from 1 day ago to the last 10 days
**w1d_10d: the window from 1 day ago to the last 10 days**
```sql
window w1d_10d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 10d PRECEDING AND 1d PRECEDING)
```

The window `w1d_10d` for the partition `id=9` contains three rows, which are `id=1`, `id=3` and `id=4`. These three rows fall in the time window of [2022-01-29 12:00:00, 2022-02-07 12:00:00]

- w0_1: the window contains the last 0 ~ 1 rows
**w0_1: the window contains the last 0 ~ 1 rows**
The window contains the last 0 ~ 1 rows, including the previous line and the current line.
```sql
window w0_1 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
```

The window `w0_1` for the partition `id=10` contains 2 rows, which are `id=7` and `id=10`.

- w2_10: the window contains the last 2 ~ 10 rows
**w2_10: the window contains the last 2 ~ 10 rows**

```sql
window w2_10 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 10 PRECEDING AND 2 PRECEDING)
Expand Down Expand Up @@ -304,7 +305,7 @@ window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PREC

We make frequency statistics for a given column as we may need to know the type of the highest frequency, the proportion of the type with the largest number, etc., in each category.

`top1_ratio`: Find out the type with the largest number and compute the proportion of its number in the window.
**`top1_ratio`**: Find out the type with the largest number and compute the proportion of its number in the window.

The following SQL uses `top1_ratio` to find out the city with the most transactions in the last 30 days and compute the proportion of the number of transactions of the city to the total number of transactions in t1.
```sql
Expand All @@ -314,7 +315,7 @@ FROM t1
window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PRECEDING AND CURRENT ROW);
```

`topn_frequency(col, top_n)`: Find the `top_n` categories with the highest frequency in the window
**`topn_frequency(col, top_n)`**: Find the `top_n` categories with the highest frequency in the window

The following SQL uses `topn_frequency` to find out the top 2 cities with the highest number of transactions in the last 30 days in t1.
```sql
Expand Down
8 changes: 4 additions & 4 deletions docs/en/tutorial/tutorial_sql_2.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ As shown below, left table `LAST JOIN` right table with `ORDER BY` and right tab

## 3. Multi-Row Aggregation over Multiple Tables

For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../reference/sql/dql/WINDOW_CLAUSE.md#window-union) syntax.
For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) syntax.
WINDOW UNION supports combining multiple pieces of data from the secondary table to form a window on secondary table.
Based on the time window, it is convenient to construct the multi-row aggregation feature of the secondary table.
Similarly, two steps need to be completed to construct the multi-row aggregation feature of the secondary table:
Expand Down Expand Up @@ -122,10 +122,10 @@ Among them, necessary elements include:
- Lower bound time must be > = Upper bound time
- The row number of lower bound must be < = The row number of upper bound
- `INSTANCE_NOT_IN_WINDOW`: It indicates that except for the current row, other data in the main table will not enter the window.
- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../reference/sql/dql/WINDOW_CLAUSE.md).
- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../openmldb_sql/sql/dql/WINDOW_CLAUSE.md).
```

### Example
#### Example

Let's see the usage of WINDOW UNION through specific examples.

Expand Down Expand Up @@ -166,7 +166,7 @@ PARTITION BY mid ORDER BY purchase_time
ROWS_RANGE BETWEEN 10d PRECEDING AND 1 PRECEDING INSTANCE_NOT_IN_WINDOW)
```

## 3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table
### 3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table

Apply the multi-row aggregation function on the created window to construct aggregation features on multi-rows of secondary table, so that the number of rows finally generated is the same as that of the main table.
For example, we can construct features from the secondary table like: the total retail sales of merchants in the last 10 days `w10d_merchant_purchase_amt_sum` and the total consumption times of the merchant in the last 10 days `w10d_merchant_purchase_count`.
Expand Down
3 changes: 2 additions & 1 deletion docs/zh/tutorial/tutorial_sql_1.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ window window_name as (PARTITION BY partition_col ORDER BY order_col ROWS_RANGE
- OpenMLDB 的下界条数必须<=上界条数

更多语法和特性可以参考 [OpenMLDB窗口参考手册](../openmldb_sql/dql/WHERE_CLAUSE.md)

#### 示例
对于上面所示的交易表 t1,我们定义两个时间窗口和两个条数窗口。每一个样本行的窗口均按用户ID(`uid`)分组,按交易时间(`trans_time`)排序。下图展示了分组排序后的数据。
![img](images/table_t1.jpg)
Expand Down Expand Up @@ -240,7 +241,7 @@ xxx_cate(col, cate) over w
- 参数`col`:参与聚合计算的列。
- 参数`cate`:分组列。

目前支持的带有 _cate 后缀的聚合函为:`count_cate`, `sum_cate`, `avg_cate`, `max_cate`, `min_cate`
目前支持的带有 `_cate` 后缀的聚合函为:`count_cate`, `sum_cate`, `avg_cate`, `max_cate`, `min_cate`

相关示例如下:

Expand Down
2 changes: 1 addition & 1 deletion docs/zh/tutorial/tutorial_sql_2.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ SELECT * FROM s1 LAST JOIN s2 ORDER BY s2.std_ts ON s1.col1 = s2.col1;

## 3. 副表多行聚合特征

OpenMLDB 针对副表拼接场景,扩展了标准的 WINDOW 语法,新增了 [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#windowunion) 的特性,支持从副表拼接多条数据形成副表窗口。在副表拼接窗口的基础上,可以方便构建副表多行聚合特征。同样地,构造副表多行聚合特征也需要完成两个步骤:
OpenMLDB 针对副表拼接场景,扩展了标准的 WINDOW 语法,新增了 [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) 的特性,支持从副表拼接多条数据形成副表窗口。在副表拼接窗口的基础上,可以方便构建副表多行聚合特征。同样地,构造副表多行聚合特征也需要完成两个步骤:

- 步骤一:定义副表拼接窗口。
- 步骤二:在副表拼接窗口上构造副表多行聚合特征。
Expand Down

0 comments on commit 15de293

Please sign in to comment.