Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219

Merged
merged 22 commits into from
Nov 5, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Oct 5, 2024

Just following up on #1563 to see if the missing typarray column is the only issue. To get the details right might be a large project, but we might be able to support a basic connection without too much effort. Paramter binding and non-COPY result fetching seem to work...the default query fetch method (COPY) is not supported, connection_get_info() fails, and at a glance, connection_get_objects() might be returning incorrect results (and fails at the column depth).

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = Sys.getenv("ADBC_REDSHIFT_TEST_URI"),
  adbc.postgresql.load_array_types = FALSE
)

con <- db |> 
  adbc_connection_init()

stmt <- con |> 
  adbc_statement_init(adbc.postgresql.use_copy = FALSE)

stream <- nanoarrow::nanoarrow_allocate_array_stream()
stmt |> 
  adbc_statement_bind(data.frame(45)) |> 
  adbc_statement_set_sql_query("SELECT 1 + $1 as foofy, 'string' as foofy_string") |> 
  adbc_statement_execute_query(stream)
#> [1] -1

tibble::as_tibble(stream)
#> # A tibble: 1 × 2
#>   foofy foofy_string
#>   <dbl> <chr>       
#> 1    46 string

Created on 2024-10-04 with reprex v2.1.1

@paleolimbot
Copy link
Member Author

@lidavidm Any ideas on a good approach here? We could eliminate the database option and just fall back on the non-array version of the type query (or issue two queries). It would still need something like cursor.adbc_statement.set_options(use_copy = False) but at least it would let Redshift users connect (and give us their bug reports about other things that don't work).

@lidavidm
Copy link
Member

It seems Redshift is juuust different enough that it's not quite Postgres anymore but not different enough to warrant a separate codebase. Is there any way to tell from libpq that we're dealing with Redshift and just automatically disable COPY and change the type query?

@WillAyd
Copy link
Contributor

WillAyd commented Oct 30, 2024

Would it make sense to try and do a text based COPY instead of binary? Or does Redshift disable that altogether?

I'm believe tools like AWS SDK for pandas use a COPY from Parquet to achieve high throughput to Redshift, so there might be some precedent to still go that route

@paleolimbot
Copy link
Member Author

It seems Redshift is juuust different enough that it's not quite Postgres anymore but not different enough to warrant a separate codebase.

I agree that it's on the knife edge!

Is there any way to tell from libpq that we're dealing with Redshift and just automatically disable COPY and change the type query?

It looks like SELECT version() does the trick here, with some parsing. I implemented this and pushed it...I'm not sure if it's too much of a hack? (I see there's some failing tests, so maybe)

Would it make sense to try and do a text based COPY instead of binary? Or does Redshift disable that altogether?

If we could to text we'd need a completely different parser (and in a funny way, even the "use copy = false" branch is using the binary COPY format, it's just accessing it through the PGresult instead of pulling the COPY using PQgetcopy().

I'm believe tools like AWS SDK for pandas use a COPY from Parquet to achieve high throughput to Redshift, so there might be some precedent to still go that route

We could probably exploit that if we used a Go or Rust based driver!

Quick demo:

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = Sys.getenv("ADBC_REDSHIFT_TEST_URI")
)

con <- db |> 
  adbc_connection_init()

con |> 
  read_adbc("SELECT 'foofy'") |> 
  tibble::as_tibble()
#> # A tibble: 1 × 1
#>   `?column?`
#>   <chr>     
#> 1 foofy

con |> 
  adbc_connection_get_info() |> 
  tibble::as_tibble()
#> # A tibble: 6 × 2
#>   info_name info_value$string_value $bool_value $int64_value $int32_bitmask
#>       <dbl> <chr>                   <lgl>              <dbl>          <int>
#> 1         0 Redshift                NA                    NA             NA
#> 2         1 1.0.77467               NA                    NA             NA
#> 3       100 ADBC PostgreSQL Driver  NA                    NA             NA
#> 4       101 (unknown)               NA                    NA             NA
#> 5       102 0.6.0                   NA                    NA             NA
#> 6       103 <NA>                    NA               1001000             NA
#> # ℹ 2 more variables: info_value$string_list <list<chr>>,
#> #   $int32_to_int32_list_map <list<df[,2]>>

Created on 2024-10-29 with reprex v2.1.1

@lidavidm
Copy link
Member

It seems it's just a missing header on a few platforms?

Copy link
Contributor

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments / suggestions

c/driver/postgresql/connection.cc Outdated Show resolved Hide resolved
c/driver/postgresql/connection.cc Outdated Show resolved Hide resolved

// While there are remaining version components and we haven't reached the end of the
// string
while (component_begin < version.size() && component < out.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a good use case for str::find here?

@paleolimbot paleolimbot changed the title poc(c/driver/postgresql): Try to connect to redshift feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift Oct 30, 2024
if (RedshiftVersion()[0] > 0) {
infos.emplace_back(info_codes[i], "Redshift");
} else {
infos.push_back({info_codes[i], "PostgreSQL"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been clearer but all of the push_back's here I think are better with emplace_back

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way, but most advice I read tends to suggest only using emplace back in specific cases (e.g., https://abseil.io/tips/112 ).

Copy link
Contributor

@WillAyd WillAyd Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++...what a language.

Well in this case either is likely fine. I am under the impression that emplace_back would avoid any calls to the move constructor of the list element, along with any move constructors that need to be called when the vector is resized. In this particular case it probably doesn't make a difference; maybe something to just look at when performance is more critical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any existing emplace_back() usage so I changed these back. We can always reevaluate!

return ADBC_STATUS_INTERNAL;
}
const char* server_version_num = (*it)[0].data;
infos.push_back({info_codes[i], server_version_num});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another spot

Comment on lines 114 to 118
kUserDefined,
// This is not an actual type, but there are cases where all we have is an Oid
// that was not inserted into the type resolver. We can't use "unknown" or "opaque"
// or "void" because those names show up in actual pg_type tables.
kUnnamed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This surfaced because apparently the "geometry" type is returned with an oid that doesn't exist (3999) despite actually existing (with oid 3000). There's really no reason we can't still return the binary data that was sent there with the appropriate arrow.opaque metadata, which is what this particular hack enables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kArrowOpaque perhaps to be explicit?

Comment on lines +216 to +217
ArrowErrorCode SetSchema(ArrowSchema* schema,
const std::string& vendor_name = "PostgreSQL") const {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets our "opaque" type have the appropriate vendor name (since it's not always "PostgreSQL" any more).

Comment on lines +179 to +184
// Allow Redshift to execute this query without constraints
// TODO(paleolimbot): Investigate to see if we can simplify the constraits query so that
// it works on both!
void SetEnableConstraints(bool enable_constraints) {
enable_constraints_ = enable_constraints;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't dig too deeply here but I did check that we get column names! I am not sure that we're getting tables from schemas outside "public" (there are quite a few things that look like sample database schemas but I don't see any tables in them listed by our query).

@paleolimbot
Copy link
Member Author

Would it make sense to try and do a text based COPY instead of binary? Or does Redshift disable that altogether?

If we did want to seriously support redshift we would need to do this somehow (right now bulk insert doesn't work, and the workaround of INSERT INTO with a bind stream is ungodly slow). There are both Rust and Go SDKs!

@lidavidm
Copy link
Member

Ah...Redshift has entirely separate SDKs? In that case maybe a different driver would be better long term...

@paleolimbot
Copy link
Member Author

Ah...Redshift has entirely separate SDKs? In that case maybe a different driver would be better long term...

It's definitely a better long-term plan since all of the performance optimizations have to be disabled for this to work (I only discovered the SDKs in the process of reading the documentation to implement the things here 🙂 ). I'm neutral on whether this PR is too big of a hack, although some of the things here are good ideas anyway (e.g., updating the type resolver population to use the helpers/status, not failing when an OID isn't recognized).

@paleolimbot paleolimbot marked this pull request as ready for review October 31, 2024 02:17
@github-actions github-actions bot added this to the ADBC Libraries 15 milestone Oct 31, 2024
@lidavidm
Copy link
Member

lidavidm commented Nov 5, 2024

I think we can see if there's enough popularity for Redshift that someone wants to develop/contribute a dedicated driver

@lidavidm
Copy link
Member

lidavidm commented Nov 5, 2024

Can we call this out in the docs, though? (Redshift works, but none of the optimizations will work)

Comment on lines +86 to +87
std::array<int, 3> postgres_server_version_{};
std::array<int, 3> redshift_server_version_{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose if you wanted to be fancy you could use std::variant with single-member structs but this works fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I have a clear vision about how that would look but if we need to update the internals to be cleaner later on we can!

Comment on lines +115 to +118
// This is not an actual type, but there are cases where all we have is an Oid
// that was not inserted into the type resolver. We can't use "unknown" or "opaque"
// or "void" because those names show up in actual pg_type tables.
kUnnamedArrowOpaque
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that seems like it should be an error? (or if we had warnings, a warning that you should recreate the connection to reload the OIDs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we'll probably have to tackle it as part of #1755

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning would be helpful when we get there...for now I think it's a little nicer as arrow.opaque because then you can inspect it in context and see a bit of the data to maybe understand what went wrong. For the case that led me to add this, even reconnecting or lazy updating wouldn't have helped (the backend pg_type table is just missing a type definition for GEOMETRY).

@paleolimbot paleolimbot merged commit 28b8c1b into apache:main Nov 5, 2024
63 checks passed
@paleolimbot paleolimbot deleted the c-driver-postgresql-redshift branch November 8, 2024 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants