Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggest add synax select from generate_series() #10069

Open
l1t1 opened this issue Apr 13, 2024 · 9 comments · May be fixed by #13540
Open

suggest add synax select from generate_series() #10069

l1t1 opened this issue Apr 13, 2024 · 9 comments · May be fixed by #13540
Labels
enhancement New feature or request

Comments

@l1t1
Copy link

l1t1 commented Apr 13, 2024

Is your feature request related to a problem or challenge?

it is complex and slowly to use unnest().

Describe the solution you'd like

it can run select from generate_series() as duckdb does.

Describe alternatives you've considered

No response

Additional context

D:\duckdb>datafusion-cli
DataFusion CLI v37.0.0

> select sum(a) from unnest(generate_series(1,1000000)) as t(a);
+--------------+
| SUM(t.a)     |
+--------------+
| 500000500000 |
+--------------+
1 row(s) fetched.
Elapsed 3.648 seconds.
> select sum(a) from generate_series(1,1000000) as t(a);
Error during planning: table function 'generate_series' not found
>
D:\duckdb>duckdb011
v0.10.1 4a89d97db8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D .timer on
D select sum(a) from unnest(generate_series(1,1000000)) as t(a);
┌──────────────┐
│    sum(a)    │
│    int128    │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.854 user 0.859375 sys 0.078125
D select sum(a) from generate_series(1,1000000) as t(a);
┌──────────────┐
│    sum(a)    │
│    int128    │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.033 user 0.015625 sys 0.000000
		
@l1t1 l1t1 added the enhancement New feature or request label Apr 13, 2024
@alamb
Copy link
Contributor

alamb commented Apr 13, 2024

I think we could add this as a user defined table function in datafusion-cli quite easily

For example we could follow the model of parquet_metadata (which also follows duckdb):

SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.

It is documented here https://arrow.apache.org/datafusion/user-guide/cli.html

The code for it is here; https://github.com/apache/arrow-datafusion/blob/637293580db0634a4efbd3f52e4700992ee3080d/datafusion-cli/src/functions.rs#L215-L442

@Lordworms
Copy link
Contributor

related #9354

@Abdullahsab3
Copy link
Contributor

Abdullahsab3 commented Sep 25, 2024

I think switching generate_series to a UDTF will break the API. I think in most cases it's an acceptable API change. However there is (at least for me) 1 particular use case that of the array generated from generate_series that will probably need closer attention.

Example I need to spread the data equally over equal intervals between 2 data points. I use `generate_series` to generate an array per row. The way that I do this now is as follows:
create table manual(time timestamp, index_value DOUBLE);
insert into
  manual (time, index_value)
values
  ('2022-01-31 23:00:00.0000000', 935035.000),
  ('2022-02-28 23:00:00.0000000', 954469.000),
  ('2022-03-31 22:00:00.0000000', 905204.000),
  ('2022-04-30 22:00:00.0000000', 971415.000);
with
    index_values AS (
    select
        time AS to_time,
        lag(time, 1) over (order by time) AS from_time,
        index_value AS to_index,
        lag(index_value, 1) over (order by time) AS from_index
    from manual
),
    series AS (
        select
            unnest(generate_series(from_time, to_time, interval '1 month')) AS time,
            (case
                when to_index < from_index THEN (999999 - from_index) + to_index
                else to_index - from_index end)
                / coalesce(nullif(array_length(generate_series(from_time, to_time, interval '1 month')), 0), 1) AS delta_value
        from index_values
    ) SELECT * FROM series;

Specifically array_length(generate_series(from_time, to_time, interval '1 month')) will need to be done differently if generate_series becomes a UDTF. It will probably need to become COUNT(time) OVER (partition by to_time, from_time) :

with
    index_values AS (
    select
        time AS to_time,
        lag(time, 1) over (order by time) AS from_time,
        index_value AS to_index,
        lag(index_value, 1) over (order by time) AS from_index
    from manual
),
    series AS (
        select
            time,
            (case
                when to_index < from_index THEN (999999 - from_index) + to_index
                else to_index - from_index end)
                / coalesce(nullif(count(time) over (partition by to_time, from_time), 0), 1) AS delta_value
        from index_values, generate_series(from_time, to_time, interval '900 seconds') AS time
    ) SELECT * FROM series;

(If you have other/better suggestions for how data can be spreaded/interpolated in raw DF, all ears :) )

TL;DR: If you're using generate_series to generate a series on a row-based basis, you may need to consider OVER PARTITION BY (your_row_composite_key) if this API change takes place

Just mentioning this here in case anyone is using generate_series in a similar way as I do. and because I want to know if this is ever added :p then I know I need to update my queries

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

FWIW I think we should consider keeping the existing generate_series function and add a new user defined table function with the same name

@Abdullahsab3
Copy link
Contributor

Abdullahsab3 commented Sep 25, 2024

That might work, but I wonder how it will be inferred and whether it will cause any conflicts. Some examples that can be ambiguous on top of my head:

select generate_series(1,3), * from generate_series(10, 13)

This can be either (postgresql behavior):

generate_series generate_series
1 10
2 10
3 10
1 11
2 11
3 11
1 12
2 12
3 12
1 13
2 13
3 13

or

+------------------------------------+----------------------------------------------+
| generate_series(Int64(1),Int64(3)) | UNNEST(generate_series(Int64(10),Int64(13))) |
+------------------------------------+----------------------------------------------+
| [1, 2, 3]                          | 10                                           |
| [1, 2, 3]                          | 11                                           |
| [1, 2, 3]                          | 12                                           |
| [1, 2, 3]                          | 13                                           |
+------------------------------------+----------------------------------------------+
4 row(s) fetched.
Elapsed 0.001 seconds.

Also, I wonder how the results of something like this would look like 🤔:

select generate_series(1, 3), array_length(generate_series(1, 3))

@2010YOUY01
Copy link
Contributor

That might work, but I wonder how it will be inferred and whether it will cause any conflicts. Some examples that can be ambiguous on top of my head:

select generate_series(1,3), * from generate_series(10, 13)

This can be either (postgresql behavior):

generate_series generate_series
1 10
2 10
3 10
1 11
2 11
3 11
1 12
2 12
3 12
1 13
2 13
3 13
or

+------------------------------------+----------------------------------------------+
| generate_series(Int64(1),Int64(3)) | UNNEST(generate_series(Int64(10),Int64(13))) |
+------------------------------------+----------------------------------------------+
| [1, 2, 3]                          | 10                                           |
| [1, 2, 3]                          | 11                                           |
| [1, 2, 3]                          | 12                                           |
| [1, 2, 3]                          | 13                                           |
+------------------------------------+----------------------------------------------+
4 row(s) fetched.
Elapsed 0.001 seconds.

Also, I wonder how the results of something like this would look like 🤔:

select generate_series(1, 3), array_length(generate_series(1, 3))

Maybe we can use the 2nd interpretation (the same as DuckDB), currently, DataFusion's generate_series in expression is already different from Postgres's semantics

@2010YOUY01
Copy link
Contributor

I think we could add this as a user defined table function in datafusion-cli quite easily

@alamb Is it possible to implement this UDTF inside datafusion crate? 🤔 I find this table function can make end-to-end tests for spilling related queries much easier

@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

I think we could add this as a user defined table function in datafusion-cli quite easily

@alamb Is it possible to implement this UDTF inside datafusion crate? 🤔 I find this table function can make end-to-end tests for spilling related queries much easier

Makes sense to me -- we may need to create a datafusion-functions-table sub crate too, however 🤔 as a place to put it

@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

BTW you can get a similar effect with unnest:

> select unnest(generate_series(1, 10));
+---------------------------------------------+
| UNNEST(generate_series(Int64(1),Int64(10))) |
+---------------------------------------------+
| 1                                           |
| 2                                           |
| 3                                           |
| 4                                           |
| 5                                           |
| 6                                           |
| 7                                           |
| 8                                           |
| 9                                           |
| 10                                          |
+---------------------------------------------+
10 row(s) fetched.
Elapsed 0.020 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants