-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggest add synax select from generate_series()
#10069
Comments
I think we could add this as a user defined table function in datafusion-cli quite easily For example we could follow the model of SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds. It is documented here https://arrow.apache.org/datafusion/user-guide/cli.html The code for it is here; https://github.com/apache/arrow-datafusion/blob/637293580db0634a4efbd3f52e4700992ee3080d/datafusion-cli/src/functions.rs#L215-L442 |
related #9354 |
I think switching ExampleI need to spread the data equally over equal intervals between 2 data points. I use `generate_series` to generate an array per row. The way that I do this now is as follows:create table manual(time timestamp, index_value DOUBLE);
insert into
manual (time, index_value)
values
('2022-01-31 23:00:00.0000000', 935035.000),
('2022-02-28 23:00:00.0000000', 954469.000),
('2022-03-31 22:00:00.0000000', 905204.000),
('2022-04-30 22:00:00.0000000', 971415.000); with
index_values AS (
select
time AS to_time,
lag(time, 1) over (order by time) AS from_time,
index_value AS to_index,
lag(index_value, 1) over (order by time) AS from_index
from manual
),
series AS (
select
unnest(generate_series(from_time, to_time, interval '1 month')) AS time,
(case
when to_index < from_index THEN (999999 - from_index) + to_index
else to_index - from_index end)
/ coalesce(nullif(array_length(generate_series(from_time, to_time, interval '1 month')), 0), 1) AS delta_value
from index_values
) SELECT * FROM series; Specifically with
index_values AS (
select
time AS to_time,
lag(time, 1) over (order by time) AS from_time,
index_value AS to_index,
lag(index_value, 1) over (order by time) AS from_index
from manual
),
series AS (
select
time,
(case
when to_index < from_index THEN (999999 - from_index) + to_index
else to_index - from_index end)
/ coalesce(nullif(count(time) over (partition by to_time, from_time), 0), 1) AS delta_value
from index_values, generate_series(from_time, to_time, interval '900 seconds') AS time
) SELECT * FROM series; (If you have other/better suggestions for how data can be spreaded/interpolated in raw DF, all ears :) ) TL;DR: If you're using Just mentioning this here in case anyone is using |
FWIW I think we should consider keeping the existing |
That might work, but I wonder how it will be inferred and whether it will cause any conflicts. Some examples that can be ambiguous on top of my head: select generate_series(1,3), * from generate_series(10, 13) This can be either (postgresql behavior):
or +------------------------------------+----------------------------------------------+
| generate_series(Int64(1),Int64(3)) | UNNEST(generate_series(Int64(10),Int64(13))) |
+------------------------------------+----------------------------------------------+
| [1, 2, 3] | 10 |
| [1, 2, 3] | 11 |
| [1, 2, 3] | 12 |
| [1, 2, 3] | 13 |
+------------------------------------+----------------------------------------------+
4 row(s) fetched.
Elapsed 0.001 seconds. Also, I wonder how the results of something like this would look like 🤔: select generate_series(1, 3), array_length(generate_series(1, 3)) |
Maybe we can use the 2nd interpretation (the same as DuckDB), currently, DataFusion's |
@alamb Is it possible to implement this UDTF inside |
Makes sense to me -- we may need to create a |
BTW you can get a similar effect with > select unnest(generate_series(1, 10));
+---------------------------------------------+
| UNNEST(generate_series(Int64(1),Int64(10))) |
+---------------------------------------------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+---------------------------------------------+
10 row(s) fetched.
Elapsed 0.020 seconds. |
Is your feature request related to a problem or challenge?
it is complex and slowly to use unnest().
Describe the solution you'd like
it can run
select from generate_series()
as duckdb does.Describe alternatives you've considered
No response
Additional context
The text was updated successfully, but these errors were encountered: