Skip to content

Commit

Permalink
feat: merge main into v0.9 (#3969)
Browse files Browse the repository at this point in the history
* feat: sbin use the generated zk conf (#3901)

Co-authored-by: lijiangnan <[email protected]>

* refactor!: relocate go sdk (#3889)

* refactor!: relocate go sdk

moving to https://github.com/4paradigm/openmldb-go-sdk

* go readme

* ci: fix sdk workflow

* docs: fix example (#3907)

raw SQL request mode example was wrong because execute_mode should be request

* fix: make clients use always send auth info (#3906)

* fix: make clients use auth by default

* fix: let skip auth flag only affect verify

* feat: tablets get user table remotely (#3918)

* fix: make clients use auth by default

* fix: let skip auth flag only affect verify

* feat: tablets get user table remotely

* fix: use FLAGS_system_table_replica_num for user table

* fix: recoverdata support load disk table (#3888)

* docs: add map desc in create table (#3912)

* ci(#3904): python mac jobs fix (#3905)

* fix(#3909): checkout execute_mode in config clause in sql client (#3910)

* feat: merge dag sql (#3911)

* feat: merge AIOS DAG SQL

* feat: mergeDAGSQL

* add AIOSUtil

* feat: add AIOS merge SQL test case

* feat: split margeDAGSQL and validateSQLInRequest

* fix: gcformat space and continuous sign (#3921)

* fix: gcformat space

* fix: gcformat continuous sign use hash

* fix: delete incorrect comments

* feat: merge 090 features to main (#3929)

* Set s3 and aws dependencies ad provided (#3897)

* feat: execlude zookeeper for curator (#3899)

* Execlude zookeeper when using curator

* Fix local build java

* Run script to update post release version (#3931)

* feat: crud users synchronously (#3928)

* fix: make clients use auth by default

* fix: let skip auth flag only affect verify

* feat: tablets get user table remotely

* fix: use FLAGS_system_table_replica_num for user table

* feat: consistent user cruds

* fix: pass instance of tablet and nameserver into auth lambda to allow locking

* feat: best effort try to flush user data to all tablets

* fix: lock scope

* fix: stop user sync thread safely

* fix: default values for user table columns

* feat(parser): simple ANSI SQL rewriter (#3934)

* feat(parser): simple ANSI SQL rewriter

* feat(draft): translate request mode query

* feat: request query rewriter

* test: tpc rewrite cases

* feat(rewrite): enable ansi sql rewriter in `ExecuteSQL`

You may explicitly set this feature on via `set session ansi_sql_rewriter
= 'true'`

TODO: this rewriter feature should be off by default

* build(deps-dev): bump urllib3 from 1.26.18 to 1.26.19 in /docs (#3948)

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.18 to 1.26.19.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/1.26.19/CHANGES.rst)
- [Commits](urllib3/urllib3@1.26.18...1.26.19)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(udf): isin (#3939)

* feat(#3916): support @@execute_mode = 'request' (#3924)

* feat(udf): array_combine & array_join (#3945)

* feat(udf): array_combine

* feat(udf): new functions

- array_combine
- array_join

* feat: casting arrays to array<string> for array_combine

WIP, string allocation need fix

* fix: array_combine with non-string types

* feat(array_combine): handle null inputs

* fix(array_combine): behavior tweaks

- use empty string if delimiter is null
- restrict to array_combine(string, array<T> ...)

* feat: support batchrequest in ProcessQuery (#3938)

* feat: user authz (#3941)

* feat: change user table to match mysql

* feat: support user authz

* fix: cean up created users

* build(deps-dev): bump requests from 2.31.0 to 2.32.2 in /docs (#3951)

Bumps [requests](https://github.com/psf/requests) from 2.31.0 to 2.32.2.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.31.0...v2.32.2)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump org.apache.derby:derby (#3949)

Bumps org.apache.derby:derby from 10.14.2.0 to 10.17.1.0.

---
updated-dependencies:
- dependency-name: org.apache.derby:derby
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump org.postgresql:postgresql (#3950)

Bumps [org.postgresql:postgresql](https://github.com/pgjdbc/pgjdbc) from 42.3.3 to 42.3.9.
- [Release notes](https://github.com/pgjdbc/pgjdbc/releases)
- [Changelog](https://github.com/pgjdbc/pgjdbc/blob/master/CHANGELOG.md)
- [Commits](pgjdbc/pgjdbc@REL42.3.3...REL42.3.9)

---
updated-dependencies:
- dependency-name: org.postgresql:postgresql
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: iot table (#3944)

* feat: iot table

* fix

* fix

* fix delete key entry

* fix comment

* ut

* ut test

* fix ut

* sleep more for truncate

* sleep 16

* tool pytest fix and swig fix

* fix

* clean

* move to base

* fix

* fix coverage ut

* fix

---------

Co-authored-by: Huang Wei <[email protected]>

* feat(open-mysql-db): pandas support (#3868)

* feat(open-mysql-db): refactor

1. remove unnecessary instance var port
2. fix cause null bug
3. remove unnecessary throws
4. fix ctx.close() sequence bug
5. config sessionTimeout and requestTimeout
6. add docs of SqlEngine

* feat(open-mysql-db): refactor

* feat(open-mysql-db): revert passsword

* feat(open-mysql-db): mock commit and schema table count

* feat(open-mysql-db): replace data type text with string

* feat(open-mysql-db): remove null

---------

Co-authored-by: yangwucheng <[email protected]>

* fix: drop aggr tables in drop table (#3908)

* fix: drop aggr tables in drop table

* fix

* fix test

* fix

* fix

---------

Co-authored-by: Huang Wei <[email protected]>

* ci(#3954): fix checkout action on old glibc OS (#3955)

* ci(#3954): fix checkout action on old glibc OS

* ci: include checkout fix in all workflows

* ci: fix python-sdk

* test: node-2 to node-3 (#3957)

node-3 is not available, moving to node-2

* feat: support locate(substr, str[, pos]) function(#820) (#3943)

* fix(scripts): deploy spark correctly (#3958)

$SPARK_HOME may be a symbolic link referring to a invalid directory, so
we'd try 'rm -f' first

* Add changelog for 0.9.1 (#3959)

* fix: select from JOB_INFO should always in online mode (#3963)

* fix: select from JOB_INFO should always in online mode

Fix error when user set default `execute_mode` to offline:

```sql
set global execute_mode = 'offline';
select 1;
```

* fix: query mode on user & pre_agg tables

* build(docker): centos7 EOL (#3965)

* build(docker): centos7 EOL

* fix vault address for aarch64

* ci(docker): disable arm64 image

Dont have arm machine to test

* fix(docker): numpy version lock (#3966)

* Update docs version to 0.9.1 (#3960)

* add blog post (#3936)

* refactor: fix compile for mcjit and improve to tests (#3952)

* refactor: rm SQL_CASE_BASE_DIR

* fix: compile on mcjit

* feat: setup SqlCaseBaseDir for hybridse

TODO: also setup for tests in src/

* docs: add blog post (#3913)

* Include new posts

* update links

* minor change

* ci: update create-pull-request action to v6 in udf-doc-gen workflow & rm deprecated file sync (#3964)

* Updated create-pull-request action to v6 in udf-doc-gen workflow

* Removed references to docs/en/reference/sql/udfs_8h.md as the file no longer exists

* build: upgrade openmldb sdk version in self host (#3962)

* docs: add changelog for 0.9.2 (#3968)

* docs: update version 0.9.2 in docs (#3970)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: venessa <[email protected]>
Co-authored-by: lijiangnan <[email protected]>
Co-authored-by: aceforeverd <[email protected]>
Co-authored-by: oh2024 <[email protected]>
Co-authored-by: HuangWei <[email protected]>
Co-authored-by: wyl4pd <[email protected]>
Co-authored-by: tobe <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Huang Wei <[email protected]>
Co-authored-by: yangwucheng <[email protected]>
Co-authored-by: yangwucheng <[email protected]>
Co-authored-by: howd <[email protected]>
Co-authored-by: Siqi Wang <[email protected]>
Co-authored-by: Jayaprakash0511 <[email protected]>
  • Loading branch information
15 people authored Jul 26, 2024
1 parent ba5e85b commit bfe5ba6
Show file tree
Hide file tree
Showing 72 changed files with 535 additions and 170 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/hybridsql-docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,6 @@ jobs:
with:
context: docker
push: ${{ github.event_name == 'push' }}
platforms: linux/amd64,linux/arm64
platforms: linux/amd64
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
3 changes: 1 addition & 2 deletions .github/workflows/udf-doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,10 @@ jobs:
make -C hybridse/tools/documentation/udf_doxygen sync
- name: Create Pull Request
uses: peter-evans/create-pull-request@v4
uses: peter-evans/create-pull-request@v6
if: github.event_name != 'pull_request'
with:
add-paths: |
docs/en/reference/sql/udfs_8h.md
docs/zh/openmldb_sql/udfs_8h.md
labels: |
udf
Expand Down
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# Changelog

## [0.9.2] - 2024-07-26

### Bug Fixes
- Fix upgrade openmldb sdk version in self host (#3962 @aceforeverd)
- Fix select from JOB_INFO should always in online mode (#3963 @aceforeverd)
- Fix update create-pull-request action to v6 in udf-doc-gen workflow & rm deprecated file sync (#3964 @Jayaprakash0511)
- Fix build in centos7 EOL (#3965 @aceforeverd)
- Fix numpy version lock (#3966 @aceforeverd)

## [0.9.1] - 2024-07-17

### Features
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ message (STATUS "CMAKE_PREFIX_PATH: ${CMAKE_PREFIX_PATH}")
message (STATUS "CMAKE_BUILD_TYPE: ${CMAKE_BUILD_TYPE}")
set(OPENMLDB_VERSION_MAJOR 0)
set(OPENMLDB_VERSION_MINOR 9)
set(OPENMLDB_VERSION_BUG 0)
set(OPENMLDB_VERSION_BUG 1)

function(get_commitid CODE_DIR COMMIT_ID)
find_package(Git REQUIRED)
Expand Down
2 changes: 1 addition & 1 deletion demo/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ RUN apt-get update \
&& rm -rf /var/lib/apt/lists/*

RUN if [ -f "/additions/pypi.txt" ] ; then pip config set global.index-url $(cat /additions/pypi.txt) ; fi
RUN pip install --no-cache-dir py4j==0.10.9 numpy lightgbm==3 tornado requests pandas==1.5 xgboost==1.4.2
RUN pip install --no-cache-dir py4j==0.10.9 lightgbm==3 tornado requests pandas==1.5 xgboost==1.4.2 numpy==1.26.4

COPY init.sh /work/
COPY predict-taxi-trip-duration/script /work/taxi-trip/
Expand Down
2 changes: 1 addition & 1 deletion demo/java_quickstart/demo/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
<dependency>
<groupId>com.4paradigm.openmldb</groupId>
<artifactId>openmldb-jdbc</artifactId>
<version>0.9.0</version>
<version>0.9.2</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
Expand Down
4 changes: 2 additions & 2 deletions demo/predict-taxi-trip-duration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ w2 as (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN
**Start docker**
```
docker run -it 4pdosc/openmldb:0.9.0 bash
docker run -it 4pdosc/openmldb:0.9.2 bash
```
**Initialize environment**
```bash
Expand Down Expand Up @@ -138,7 +138,7 @@ python3 predict.py
**Start docker**
```bash
docker run -it 4pdosc/openmldb:0.9.0 bash
docker run -it 4pdosc/openmldb:0.9.2 bash
```
**Initialize environment**
Expand Down
2 changes: 1 addition & 1 deletion demo/talkingdata-adtracking-fraud-detection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ We recommend you to use docker to run the demo. OpenMLDB and dependencies have b
**Start docker**

```
docker run -it 4pdosc/openmldb:0.9.0 bash
docker run -it 4pdosc/openmldb:0.9.2 bash
```

#### Run locally
Expand Down
10 changes: 7 additions & 3 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,13 @@ ARG TARGETARCH

LABEL org.opencontainers.image.source https://github.com/4paradigm/OpenMLDB

COPY setup_deps.sh /
COPY ./*.sh /
# hadolint ignore=DL3031,DL3033
RUN yum update -y && yum install -y centos-release-scl epel-release && \
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo && \
sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo && \
sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo && \
yum update -y && yum install -y centos-release-scl epel-release && \
/patch_yum_repo.sh && \
yum install -y devtoolset-8 rh-git227 devtoolset-8-libasan-devel flex doxygen java-1.8.0-openjdk-devel rh-python38-python-devel rh-python38-python-wheel rh-python38-python-requests rh-python38-python-pip && \
curl -Lo lcov-1.15-1.noarch.rpm https://github.com/linux-test-project/lcov/releases/download/v1.15/lcov-1.15-1.noarch.rpm && \
yum localinstall -y lcov-1.15-1.noarch.rpm && \
Expand All @@ -33,7 +37,7 @@ RUN yum update -y && yum install -y centos-release-scl epel-release && \
tar xzf zookeeper.tar.gz -C /deps/src && \
rm -v ./*.tar.gz && \
/setup_deps.sh -a "$TARGETARCH" -z "$ZETASQL_VERSION" -t "$THIRDPARTY_VERSION" && \
rm -v /setup_deps.sh
rm -v /*.sh

ENV THIRD_PARTY_DIR=/deps/usr
ENV THIRD_PARTY_SRC_DIR=/deps/src
Expand Down
11 changes: 11 additions & 0 deletions docker/patch_yum_repo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

set -e

sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

if [[ "$ARCH" = "aarch64" ]]; then
sed -i s/vault.centos.org\\/centos/vault.centos.org\\/altarch/g /etc/yum.repos.d/*.repo
fi
2 changes: 1 addition & 1 deletion docs/en/blog_post/20240402_OpenmldbVsRedis.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ We plan to test with 1 million (referred to as 1M) keys, each corresponding to 1
Deployment can be done through containerization or directly on physical machines using software packages. There is no significant difference between the two methods. Below is an example of using containerization for deployment:

- OpenMLDB
- Docker image: `docker pull 4pdosc/openmldb:0.9.0`
- Docker image: `docker pull 4pdosc/openmldb:0.9.2`
- Documentation: [https://openmldb.ai/docs/zh/main/quickstart/openmldb_quickstart.html](https://openmldb.ai/docs/zh/main/quickstart/openmldb_quickstart.html)

- Redis:
Expand Down
56 changes: 56 additions & 0 deletions docs/en/blog_post/20240503_OpenmldbRelease.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# OpenMLDB v0.9.0 Release: Major Upgrade in SQL Capabilities Covering the Entire Feature Servicing Process

OpenMLDB has just released a new version v0.9.0, including SQL syntax extensions, MySQL protocol compatibility, TiDB storage support, online feature computation, feature signatures, and more. Among these, the most noteworthy features are the MySQL protocol and ANSI SQL compatibility, along with the extended SQL syntax capabilities.

Firstly, MySQL protocol compatibility allows OpenMLDB users to access OpenMLDB clusters using any MySQL client, not limited to GUI applications like NaviCat or Sequal Ace but also Java JDBC MySQL Driver, Python SQLAlchemy, Go MySQL Driver, and various programming language SDKs. For more information, you can refer to "[**Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and Multi-Language MySQL Client**](20240322_Openmysqldb.md)".

Secondly, the new version significantly expands SQL capabilities, especially implementing OpenMLDB’s unique request mode and stored procedure execution within standard SQL syntax. Compared to traditional SQL databases, OpenMLDB covers the entire machine learning process, including offline and online modes. In online mode, users can input sample data, and get feature results through SQL feature extraction. On the contrary, in the past, we needed to deploy SQL as a stored procedure through the `Deploy` command and then perform online feature computation through SDKs or HTTP interfaces. The new version adds `SELECT CONFIG` and `CALL` statements, allowing users to directly specify request mode and sample data in SQL to compute feature results, as shown below:

```
-- Execute online request mode query for action (10, "foo", timestamp(4000))
SELECT id, count(val) over (partition by id order by ts rows between 10 preceding and current row)
FROM t1
CONFIG (execute_mode = 'online', values = (10, "foo", timestamp(4000)))
```
You can also use the ANSI SQL `CALL` statement to invoke stored procedures with sample rows as parameters, as shown below:

```
-- Execute online request mode query for action (10, "foo", timestamp(4000))
DEPLOY window_features SELECT id, count(val) over (partition by id order by ts rows between 10 preceding and current row)
FROM t1;
CALL window_features(10, "foo", timestamp(4000))
```
For detailed release notes, please refer to: [https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0](https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0)

Please feel free to download and explore the latest release. Your feedback is highly valued and appreciated. We encourage you to share your thoughts and suggestions to help us improve and enhance the platform. Thank you for your support!

## Release Date

April 25, 2024

## Release Note

[https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0](https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0)

## Highlighted Features

* Added support for the latest version of SQLAlchemy 2, seamlessly integrating with popular Python frameworks such as Pandas and Numpy.

* Expanded support for more data backends, integrating TiDB’s distributed file storage capability with OpenMLDB’s high-performance in-memory feature computation capability.

* Enhanced ANSI SQL support, fixed `first_value` semantics, supported `MAP` type and feature signatures, and added offline mode support for `INSERT` statements.

* Added support for MySQL protocol, allowing access to OpenMLDB clusters using MySQL clients like NaviCat, Sequal Ace, and various MySQL SDKs for programming languages.

* Extended SQL syntax support, enabling online feature computation directly through `SELECT CONFIG` or `CALL` statements.

--------------------------------------------------------------------------------------------------------------

**For more information on OpenMLDB:**
* Official website: [https://openmldb.ai/](https://openmldb.ai/)
* GitHub: [https://github.com/4paradigm/OpenMLDB](https://github.com/4paradigm/OpenMLDB)
* Documentation: [https://openmldb.ai/docs/en/](https://openmldb.ai/docs/en/)
* Join us on [**Slack**](https://join.slack.com/t/openmldb/shared_invite/zt-ozu3llie-K~hn9Ss1GZcFW2~K_L5sMg)!

> _This post is a re-post from [OpenMLDB Blogs](https://openmldb.medium.com/)._
108 changes: 108 additions & 0 deletions docs/en/blog_post/20240523_OpenmldbFeatureSignatures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Introducing OpenMLDB’s New Feature: Feature Signatures — Enabling Complete Feature Engineering with SQL

## Background

Rewinding to 2020, the Feature Engine team of Fourth Paradigm submitted and passed an invention patent titled “[Data Processing Method, Device, Electronic Equipment, and Storage Medium Based on SQL](https://patents.google.com/patent/CN111752967A)”. This patent innovatively combines the SQL data processing language with machine learning feature signatures, greatly expanding the functional boundaries of SQL statements.

![Screenshot of Patent in Cinese](https://cdn-images-1.medium.com/max/2560/1*V5fQ3koN8HFikmZWJPtykA.png)

At that time, no SQL database or OLAP engine on the market supported this syntax, and even on Fourth Paradigm’s machine learning platform, the feature signature function could only be implemented using a custom DSL (Domain-Specific Language).

Finally, in version v0.9.0, OpenMLDB introduced the feature signature function, supporting sample output in formats such as CSV and LIBSVM. This allows direct integration with machine learning training or prediction while ensuring consistency between offline and online environments.

## Feature Signatures and Label Signatures

The feature signature function in OpenMLDB is implemented based on a series of OpenMLDB-customized UDFs (User-Defined Functions) on top of standard SQL. Currently, OpenMLDB supports the following signature functions:

* `continuous(column)`: Indicates that the column is a continuous feature; the column can be of any numerical type.

* `discrete(column[, bucket_size])`: Indicates that the column is a discrete feature; the column can be of boolean type, integer type, or date and time type. The optional parameter `bucket_size` sets the number of buckets. If `bucket_size` is not specified, the range of values is the entire range of the int64 type.

* `binary_label(column)`: Indicates that the column is a binary classification label; the column must be of boolean type.

* `multiclass_label(column)`: Indicates that the column is a multiclass classification label; the column can be of boolean type or integer type.

* `regression_label(column)`: Indicates that the column is a regression label; the column can be of any numerical type.

These functions must be used in conjunction with the sample format functions `csv` or `libsvm` and cannot be used independently. `csv` and `libsvm` can accept any number of parameters, and each parameter needs to be specified using functions like `continuous` to determine how to sign it. OpenMLDB handles null and erroneous data appropriately, retaining the maximum amount of sample information.

## Usage Example

First, follow the [quick start](https://openmldb.ai/docs/en/main/tutorial/standalone_use.html) guide to get the image and start the OpenMLDB server and client.
```bash
docker run -it 4pdosc/openmldb:0.9.2 bash
/work/init.sh
/work/openmldb/sbin/openmldb-cli.sh
```

Create a database and import data in the OpenMLDB client.
```sql
--OpenMLDB CLI
CREATE DATABASE demo_db;
USE demo_db;
CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
SET @@execute_mode='offline';
LOAD DATA INFILE '/work/taxi-trip/data/taxi_tour_table_train_simple.snappy.parquet' INTO TABLE t1 options(format='parquet', header=true, mode='append');
```

Use the `SHOW JOBS` command to check the task running status. After the task is successfully executed, perform feature engineering and export the training data in CSV format.

Currently, OpenMLDB does not support overly long column names, so specifying the column name of the sample as `instance` using `SELECT csv(...)` AS instance is necessary.

```sql
--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
WITH t1 as (SELECT trip_duration,
passenger_count,
sum(pickup_latitude) OVER w AS vendor_sum_pl,
count(vendor_id) OVER w AS vendor_cnt,
FROM t1
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW))
SELECT csv(
regression_label(trip_duration),
continuous(passenger_count),
continuous(vendor_sum_pl),
continuous(vendor_cnt),
discrete(vendor_cnt DIV 10)) AS instance
FROM t1 INTO OUTFILE '/tmp/feature_data_csv' OPTIONS(format='csv', header=false, quote='');
```

If LIBSVM format training data is needed, simply change `SELECT csv(...)` to `SELECT libsvm(...)`. Note that the `OPTIONS` should still use the CSV format because the exported data only has one column, which already contains the complete LIBSVM format sample.

Moreover, the `libsvm` function will start numbering continuous features and discrete features with a known number of buckets from 1. Therefore, specifying the number of buckets ensures that the feature encoding ranges of different columns do not conflict. If the number of buckets for discrete features is not specified, there is a small probability of feature signature conflict in some samples.

```sql
--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
WITH t1 as (SELECT trip_duration,
passenger_count,
sum(pickup_latitude) OVER w AS vendor_sum_pl,
count(vendor_id) OVER w AS vendor_cnt,
FROM t1
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW))
SELECT libsvm(
regression_label(trip_duration),
continuous(passenger_count),
continuous(vendor_sum_pl),
continuous(vendor_cnt),
discrete(vendor_cnt DIV 10, 100)) AS instance
FROM t1 INTO OUTFILE '/tmp/feature_data_libsvm' OPTIONS(format='csv', header=false, quote='');
```

## Summary

By combining SQL with machine learning, feature signatures simplify the data processing workflow, making feature engineering more efficient and consistent. This innovation extends the functional boundaries of SQL, supporting the output of various formats of data samples, directly connecting to machine learning training and prediction, improving data processing flexibility and accuracy, and having significant implications for data science and engineering practices.

OpenMLDB introduces signature functions to further bridge the gap between feature engineering and machine learning frameworks. By uniformly signing samples with OpenMLDB, offline and online consistency can be improved throughout the entire process, reducing maintenance and change costs. In the future, OpenMLDB will add more signature functions, including one-hot encoding and feature crossing, to make the information in sample feature data more easily utilized by machine learning frameworks.

--------------------------------------------------------------------------------------------------------------

**For more information on OpenMLDB:**
* Official website: [https://openmldb.ai/](https://openmldb.ai/)
* GitHub: [https://github.com/4paradigm/OpenMLDB](https://github.com/4paradigm/OpenMLDB)
* Documentation: [https://openmldb.ai/docs/en/](https://openmldb.ai/docs/en/)
* Join us on [**Slack**](https://join.slack.com/t/openmldb/shared_invite/zt-ozu3llie-K~hn9Ss1GZcFW2~K_L5sMg)!

> _This post is a re-post from [OpenMLDB Blogs](https://openmldb.medium.com/)._
7 changes: 6 additions & 1 deletion docs/en/blog_post/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,9 @@ OpenMLDB Blogs

Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and Multi-Language MySQL Client <20240322_Openmysqldb.md>

Comparative Analysis of Memory Consumption: OpenMLDB vs Redis Test Report <20240402_OpenmldbVsRedis.md>
Comparative Analysis of Memory Consumption: OpenMLDB vs Redis Test Report <20240402_OpenmldbVsRedis.md>

OpenMLDB v0.9.0 Release: Major Upgrade in SQL Capabilities Covering the Entire Feature Servicing Process <20240503_OpenmldbRelease.md>

Introducing OpenMLDB’s New Feature: Feature Signatures — Enabling Complete Feature Engineering with SQL <20240523_OpenmldbFeatureSignatures.md>

Loading

0 comments on commit bfe5ba6

Please sign in to comment.