Skip to content

Commit

Permalink
Merge branch 'main' of github.com:datafuselabs/databend into improve_…
Browse files Browse the repository at this point in the history
…hash_join_take
  • Loading branch information
Dousir9 committed Sep 22, 2023
2 parents 491df68 + 14819b3 commit e385a09
Show file tree
Hide file tree
Showing 98 changed files with 2,353 additions and 692 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/doc/13-sql-reference/99-ansi-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Databend aims to conform to the SQL standard, with particular support for ISO/IE
| E121-17 | WITH HOLD cursors | <span class="text-red">No</span> | |
| **E131** | **Null value support (nulls in lieu of values)** | <span class="text-blue">Yes</span> | |
| **E141** | **Basic integrity constraints** | <span class="text-red">No</span> | |
| E141-01 | NOT NULL constraints | <span class="text-blue">Yes</span> | Default in Databend: All columns are non-nullable (NOT NULL). |
| E141-01 | NOT NULL constraints | <span class="text-blue">Yes</span> | Default in Databend: All columns are nullable. |
| E141-02 | UNIQUE constraint of NOT NULL columns | <span class="text-red">No</span> | |
| E141-03 | PRIMARY KEY constraints | <span class="text-red">No</span> | |
| E141-04 | Basic FOREIGN KEY constraint with the NO ACTION default for both referential delete action and referential update action | <span class="text-red">No</span> | |
Expand Down
6 changes: 1 addition & 5 deletions docs/doc/14-sql-commands/00-ddl/50-udf/_category_.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
{
"label": "User-Defined Function",
"link": {
"type": "generated-index",
"slug": "/sql-commands/ddl/udf"
}
"label": "User-Defined Function"
}
27 changes: 19 additions & 8 deletions docs/doc/14-sql-commands/00-ddl/50-udf/ddl-alter-function.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,33 @@ title: ALTER FUNCTION
description:
Modifies the properties for an existing user-defined function.
---
import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.116"/>

Alters a user-defined function.

## Syntax

```sql
CREATE FUNCTION <name> AS ([ argname ]) -> '<function_definition>'
-- Alter UDF created with lambda expression
ALTER FUNCTION [IF NOT EXISTS] <function_name>
AS (<input_param_names>) -> <lambda_expression>
[DESC='<description>']

-- Alter UDF created with UDF server
ALTER FUNCTION [IF NOT EXISTS] <function_name>
AS (<input_param_types>) RETURNS <return_type> LANGUAGE <language_name>
HANDLER = '<handler_name>' ADDRESS = '<udf_server_address>'
[DESC='<description>']
```

## Examples

```sql
CREATE FUNCTION a_plus_3 AS (a) -> a+3+3;
ALTER FUNCTION a_plus_3 AS (a) -> a+3;

SELECT a_plus_3(2);
+---------+
| (2 + 3) |
+---------+
| 5 |
+---------+
```
CREATE FUNCTION gcd (INT, INT) RETURNS INT LANGUAGE python HANDLER = 'gcd' ADDRESS = 'http://0.0.0.0:8815';
ALTER FUNCTION gcd (INT, INT) RETURNS INT LANGUAGE python HANDLER = 'gcd_new' ADDRESS = 'http://0.0.0.0:8815';
```
118 changes: 114 additions & 4 deletions docs/doc/14-sql-commands/00-ddl/50-udf/ddl-create-function.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,44 @@ title: CREATE FUNCTION
description:
Create a new user-defined scalar function.
---
import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.116"/>

## CREATE FUNCTION

Creates a new UDF (user-defined function), the UDF can contain an SQL expression.
Creates a user-defined function.

## Syntax

```sql
CREATE FUNCTION [ IF NOT EXISTS ] <name> AS ([ argname ]) -> '<function_definition>'
-- Create with lambda expression
CREATE FUNCTION [IF NOT EXISTS] <function_name>
AS (<input_param_names>) -> <lambda_expression>
[DESC='<description>']


-- Create with UDF server
CREATE FUNCTION [IF NOT EXISTS] <function_name>
AS (<input_param_types>) RETURNS <return_type> LANGUAGE <language_name>
HANDLER = '<handler_name>' ADDRESS = '<udf_server_address>'
[DESC='<description>']
```

| Parameter | Description |
|-----------------------|---------------------------------------------------------------------------------------------------|
| `<function_name>` | The name of the function. |
| `<lambda_expression>` | The lambda expression or code snippet defining the function's behavior. |
| `DESC='<description>'` | Description of the UDF.|
| `<<input_param_names>`| A list of input parameter names. Separated by comma.|
| `<<input_param_types>`| A list of input parameter types. Separated by comma.|
| `<return_type>` | The return type of the function. |
| `LANGUAGE` | Specifies the language used to write the function. Available values: `python`. |
| `HANDLER = '<handler_name>'` | Specifies the name of the function's handler. |
| `ADDRESS = '<udf_server_address>'` | Specifies the address of the UDF server. |

## Examples

### Creating UDF with Lambda Expression

```sql
CREATE FUNCTION a_plus_3 AS (a) -> a+3;

Expand Down Expand Up @@ -53,3 +77,89 @@ DROP FUNCTION get_v2;

DROP TABLE json_table;
```

### Creating UDF with UDF Server (Python)

This example demonstrates how to enable and configure a UDF server in Python:

1. Enable UDF server support by adding the following parameters to the [query] section in the [databend-query.toml](https://github.com/datafuselabs/databend/blob/main/scripts/distribution/configs/databend-query.toml) configuration file.

```toml title='databend-query.toml'
[query]
...
enable_udf_server = true
# List the allowed UDF server addresses, separating multiple addresses with commas.
# For example, ['http://0.0.0.0:8815', 'http://example.com']
udf_server_allow_list = ['http://0.0.0.0:8815']
...
```

2. Define your function. This code defines and runs a UDF server in Python, which exposes a custom function *gcd* for calculating the greatest common divisor of two integers and allows remote execution of this function:

:::note
The SDK package is not yet available. Prior to its release, please download the 'udf.py' file from https://github.com/datafuselabs/databend/blob/main/tests/udf-server/udf.py and ensure it is saved in the same directory as this Python script. This step is essential for the code to function correctly.
:::

```python title='udf_server.py'
from udf import *

@udf(
input_types=["INT", "INT"],
result_type="INT",
skip_null=True,
)
def gcd(x: int, y: int) -> int:
while y != 0:
(x, y) = (y, x % y)
return x

if __name__ == '__main__':
# create a UDF server listening at '0.0.0.0:8815'
server = UdfServer("0.0.0.0:8815")
# add defined functions
server.add_function(gcd)
# start the UDF server
server.serve()
```

`@udf` is a decorator used for defining UDFs in Databend, supporting the following parameters:

| Parameter | Description |
|--------------|-----------------------------------------------------------------------------------------------------|
| input_types | A list of strings or Arrow data types that specify the input data types. |
| result_type | A string or an Arrow data type that specifies the return value type. |
| name | An optional string specifying the function name. If not provided, the original name will be used. |
| io_threads | Number of I/O threads used per data chunk for I/O bound functions. |
| skip_null | A boolean value specifying whether to skip NULL values. If set to True, NULL values will not be passed to the function, and the corresponding return value is set to NULL. Default is False. |

This table illustrates the correspondence between Databend data types and their corresponding Python equivalents:

| Databend Type | Python Type |
|-----------------------|-----------------------|
| BOOLEAN | bool |
| TINYINT (UNSIGNED) | int |
| SMALLINT (UNSIGNED) | int |
| INT (UNSIGNED) | int |
| BIGINT (UNSIGNED) | int |
| FLOAT | float |
| DOUBLE | float |
| DECIMAL | decimal.Decimal |
| DATE | datetime.date |
| TIMESTAMP | datetime.datetime |
| VARCHAR | str |
| VARIANT | any |
| MAP(K,V) | dict |
| ARRAY(T) | list[T] |
| TUPLE(T...) | tuple(T...) |

3. Run the Python file to start the UDF server:

```shell
python3 udf_server.py
```

4. Register the function *gcd* with the [CREATE FUNCTION](ddl-create-function.md) in Databend:

```sql
CREATE FUNCTION gcd (INT, INT) RETURNS INT LANGUAGE python HANDLER = 'gcd' ADDRESS = 'http://0.0.0.0:8815'
```
6 changes: 3 additions & 3 deletions docs/doc/14-sql-commands/00-ddl/50-udf/ddl-drop-function.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ description:
Drop an existing user-defined function.
---

Drop an existing user-defined function.
Drops a user-defined function.

## Syntax

```sql
DROP FUNCTION [IF EXISTS] <name>
DROP FUNCTION [IF EXISTS] <function_name>
```

## Examples
Expand All @@ -19,4 +19,4 @@ DROP FUNCTION a_plus_3;

SELECT a_plus_3(2);
ERROR 1105 (HY000): Code: 2602, Text = Unknown Function a_plus_3 (while in analyze select projection).
```
```
125 changes: 125 additions & 0 deletions docs/doc/14-sql-commands/00-ddl/50-udf/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: User-Defined Function
---
import IndexOverviewList from '@site/src/components/IndexOverviewList';

## What are UDFs?

User-Defined Functions (UDFs) enable you to define their own custom operations to process data within Databend. They are typically written using lambda expressions or implemented via a UDF server with programming languages such as Python and are executed as part of Databend's query processing pipeline. Advantages of using UDFs include:

- Customized Data Transformations: UDFs empower you to perform data transformations that may not be achievable through built-in Databend functions alone. This customization is particularly valuable for handling unique data formats or business logic.

- Performance Optimization: UDFs provide the flexibility to define and fine-tune your own custom functions, enabling you to optimize data processing to meet precise performance requirements. This means you can tailor the code for maximum efficiency, ensuring that your data processing tasks run as efficiently as possible.

- Code Reusability: UDFs can be reused across multiple queries, saving time and effort in coding and maintaining data processing logic.

## Managing UDFs

To manage UDFs in Databend, use the following commands:

<IndexOverviewList />

## Usage Examples

This section demonstrates two UDF implementation methods within Databend: one by creating UDFs with lambda expressions and the other by utilizing UDF servers in conjunction with Python. For additional examples of defining UDFs in various programming languages, see [CREATE FUNCTION](ddl-create-function.md).

### UDF Implementation with Lambda Expression

This example implements a UDF named *a_plus_3* using a lambda expression:

```sql
CREATE FUNCTION a_plus_3 AS (a) -> a+3;

SELECT a_plus_3(2);
+---------+
| (2 + 3) |
+---------+
| 5 |
+---------+
```

### UDF Implementation via UDF Server

This example demonstrates how to enable and configure a UDF server in Python:

1. Enable UDF server support by adding the following parameters to the [query] section in the [databend-query.toml](https://github.com/datafuselabs/databend/blob/main/scripts/distribution/configs/databend-query.toml) configuration file.

```toml title='databend-query.toml'
[query]
...
enable_udf_server = true
# List the allowed UDF server addresses, separating multiple addresses with commas.
# For example, ['http://0.0.0.0:8815', 'http://example.com']
udf_server_allow_list = ['http://0.0.0.0:8815']
...
```

2. Define your function. This code defines and runs a UDF server in Python, which exposes a custom function *gcd* for calculating the greatest common divisor of two integers and allows remote execution of this function:

:::note
The SDK package is not yet available. Prior to its release, please download the 'udf.py' file from https://github.com/datafuselabs/databend/blob/main/tests/udf-server/udf.py and ensure it is saved in the same directory as this Python script. This step is essential for the code to function correctly.
:::

```python title='udf_server.py'
from udf import *

@udf(
input_types=["INT", "INT"],
result_type="INT",
skip_null=True,
)
def gcd(x: int, y: int) -> int:
while y != 0:
(x, y) = (y, x % y)
return x

if __name__ == '__main__':
# create a UDF server listening at '0.0.0.0:8815'
server = UdfServer("0.0.0.0:8815")
# add defined functions
server.add_function(gcd)
# start the UDF server
server.serve()
```

`@udf` is a decorator used for defining UDFs in Databend, supporting the following parameters:

| Parameter | Description |
|--------------|-----------------------------------------------------------------------------------------------------|
| input_types | A list of strings or Arrow data types that specify the input data types. |
| result_type | A string or an Arrow data type that specifies the return value type. |
| name | An optional string specifying the function name. If not provided, the original name will be used. |
| io_threads | Number of I/O threads used per data chunk for I/O bound functions. |
| skip_null | A boolean value specifying whether to skip NULL values. If set to True, NULL values will not be passed to the function, and the corresponding return value is set to NULL. Default is False. |

This table illustrates the correspondence between Databend data types and their corresponding Python equivalents:

| Databend Type | Python Type |
|-----------------------|-----------------------|
| BOOLEAN | bool |
| TINYINT (UNSIGNED) | int |
| SMALLINT (UNSIGNED) | int |
| INT (UNSIGNED) | int |
| BIGINT (UNSIGNED) | int |
| FLOAT | float |
| DOUBLE | float |
| DECIMAL | decimal.Decimal |
| DATE | datetime.date |
| TIMESTAMP | datetime.datetime |
| VARCHAR | str |
| VARIANT | any |
| MAP(K,V) | dict |
| ARRAY(T) | list[T] |
| TUPLE(T...) | tuple(T...) |

3. Run the Python file to start the UDF server:

```shell
python3 udf_server.py
```

4. Register the function *gcd* with the [CREATE FUNCTION](ddl-create-function.md) in Databend:

```sql
CREATE FUNCTION gcd (INT, INT) RETURNS INT LANGUAGE python HANDLER = 'gcd' ADDRESS = 'http://0.0.0.0:8815'
```
6 changes: 5 additions & 1 deletion docs/doc/14-sql-commands/10-dml/dml-copy-into-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,10 +184,14 @@ externalLocation ::=

Specify a list of one or more files names (separated by commas) to be loaded.

### PATTERN = 'regex_pattern'
### PATTERN = '<regex_pattern>'

A [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern string, enclosed in single quotes, specifying the file names to match. Click [here](#loading-data-with-pattern-matching) to see an example. For PCRE2 syntax, see http://www.pcre.org/current/doc/html/pcre2syntax.html.

:::note
Suppose there is a file `@<stage_name>/<path>/<sub_path>`, to include it, `<sub_path>` needs to match `^<regex_pattern>$`.
:::

### FILE_FORMAT

See [Input & Output File Formats](../../13-sql-reference/50-file-format-options.md).
Expand Down
7 changes: 6 additions & 1 deletion docs/doc/15-sql-functions/112-table-functions/list_stage.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,15 @@ externalStage ::= @<external_stage_name>[/<path>]
userStage ::= @~[/<path>]
```

### PATTERN

See [COPY INTO table](/14-sql-commands/10-dml/dml-copy-into-table.md).


## Examples

```sql
SELECT * FROM list_stage(location => '@my_stage/', pattern => '.log');
SELECT * FROM list_stage(location => '@my_stage/', pattern => '.*[.]log');
+----------------+------+------------------------------------+-------------------------------+---------+
| name | size | md5 | last_modified | creator |
+----------------+------+------------------------------------+-------------------------------+---------+
Expand Down
Loading

0 comments on commit e385a09

Please sign in to comment.