-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219
feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219
Conversation
@lidavidm Any ideas on a good approach here? We could eliminate the database option and just fall back on the non-array version of the type query (or issue two queries). It would still need something like |
It seems Redshift is juuust different enough that it's not quite Postgres anymore but not different enough to warrant a separate codebase. Is there any way to tell from libpq that we're dealing with Redshift and just automatically disable COPY and change the type query? |
Would it make sense to try and do a text based COPY instead of binary? Or does Redshift disable that altogether? I'm believe tools like AWS SDK for pandas use a COPY from Parquet to achieve high throughput to Redshift, so there might be some precedent to still go that route |
846221b
to
148f44d
Compare
I agree that it's on the knife edge!
It looks like
If we could to text we'd need a completely different parser (and in a funny way, even the "use copy = false" branch is using the binary COPY format, it's just accessing it through the
We could probably exploit that if we used a Go or Rust based driver! Quick demo: library(adbcdrivermanager)
db <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = Sys.getenv("ADBC_REDSHIFT_TEST_URI")
)
con <- db |>
adbc_connection_init()
con |>
read_adbc("SELECT 'foofy'") |>
tibble::as_tibble()
#> # A tibble: 1 × 1
#> `?column?`
#> <chr>
#> 1 foofy
con |>
adbc_connection_get_info() |>
tibble::as_tibble()
#> # A tibble: 6 × 2
#> info_name info_value$string_value $bool_value $int64_value $int32_bitmask
#> <dbl> <chr> <lgl> <dbl> <int>
#> 1 0 Redshift NA NA NA
#> 2 1 1.0.77467 NA NA NA
#> 3 100 ADBC PostgreSQL Driver NA NA NA
#> 4 101 (unknown) NA NA NA
#> 5 102 0.6.0 NA NA NA
#> 6 103 <NA> NA 1001000 NA
#> # ℹ 2 more variables: info_value$string_list <list<chr>>,
#> # $int32_to_int32_list_map <list<df[,2]>> Created on 2024-10-29 with reprex v2.1.1 |
It seems it's just a missing header on a few platforms? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments / suggestions
|
||
// While there are remaining version components and we haven't reached the end of the | ||
// string | ||
while (component_begin < version.size() && component < out.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a good use case for str::find here?
Co-authored-by: William Ayd <[email protected]>
Co-authored-by: William Ayd <[email protected]>
c/driver/postgresql/connection.cc
Outdated
if (RedshiftVersion()[0] > 0) { | ||
infos.emplace_back(info_codes[i], "Redshift"); | ||
} else { | ||
infos.push_back({info_codes[i], "PostgreSQL"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have been clearer but all of the push_back's here I think are better with emplace_back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way, but most advice I read tends to suggest only using emplace back in specific cases (e.g., https://abseil.io/tips/112 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++...what a language.
Well in this case either is likely fine. I am under the impression that emplace_back would avoid any calls to the move constructor of the list element, along with any move constructors that need to be called when the vector is resized. In this particular case it probably doesn't make a difference; maybe something to just look at when performance is more critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any existing emplace_back()
usage so I changed these back. We can always reevaluate!
return ADBC_STATUS_INTERNAL; | ||
} | ||
const char* server_version_num = (*it)[0].data; | ||
infos.push_back({info_codes[i], server_version_num}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another spot
c/driver/postgresql/postgres_type.h
Outdated
kUserDefined, | ||
// This is not an actual type, but there are cases where all we have is an Oid | ||
// that was not inserted into the type resolver. We can't use "unknown" or "opaque" | ||
// or "void" because those names show up in actual pg_type tables. | ||
kUnnamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This surfaced because apparently the "geometry" type is returned with an oid that doesn't exist (3999) despite actually existing (with oid 3000). There's really no reason we can't still return the binary data that was sent there with the appropriate arrow.opaque
metadata, which is what this particular hack enables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kArrowOpaque perhaps to be explicit?
ArrowErrorCode SetSchema(ArrowSchema* schema, | ||
const std::string& vendor_name = "PostgreSQL") const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lets our "opaque" type have the appropriate vendor name (since it's not always "PostgreSQL" any more).
// Allow Redshift to execute this query without constraints | ||
// TODO(paleolimbot): Investigate to see if we can simplify the constraits query so that | ||
// it works on both! | ||
void SetEnableConstraints(bool enable_constraints) { | ||
enable_constraints_ = enable_constraints; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't dig too deeply here but I did check that we get column names! I am not sure that we're getting tables from schemas outside "public" (there are quite a few things that look like sample database schemas but I don't see any tables in them listed by our query).
If we did want to seriously support redshift we would need to do this somehow (right now bulk insert doesn't work, and the workaround of |
Ah...Redshift has entirely separate SDKs? In that case maybe a different driver would be better long term... |
It's definitely a better long-term plan since all of the performance optimizations have to be disabled for this to work (I only discovered the SDKs in the process of reading the documentation to implement the things here 🙂 ). I'm neutral on whether this PR is too big of a hack, although some of the things here are good ideas anyway (e.g., updating the type resolver population to use the helpers/status, not failing when an OID isn't recognized). |
I think we can see if there's enough popularity for Redshift that someone wants to develop/contribute a dedicated driver |
Can we call this out in the docs, though? (Redshift works, but none of the optimizations will work) |
std::array<int, 3> postgres_server_version_{}; | ||
std::array<int, 3> redshift_server_version_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose if you wanted to be fancy you could use std::variant with single-member structs but this works fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I have a clear vision about how that would look but if we need to update the internals to be cleaner later on we can!
// This is not an actual type, but there are cases where all we have is an Oid | ||
// that was not inserted into the type resolver. We can't use "unknown" or "opaque" | ||
// or "void" because those names show up in actual pg_type tables. | ||
kUnnamedArrowOpaque |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that seems like it should be an error? (or if we had warnings, a warning that you should recreate the connection to reload the OIDs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we'll probably have to tackle it as part of #1755
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning would be helpful when we get there...for now I think it's a little nicer as arrow.opaque
because then you can inspect it in context and see a bit of the data to maybe understand what went wrong. For the case that led me to add this, even reconnecting or lazy updating wouldn't have helped (the backend pg_type table is just missing a type definition for GEOMETRY).
Just following up on #1563 to see if the missing
typarray
column is the only issue. To get the details right might be a large project, but we might be able to support a basic connection without too much effort. Paramter binding and non-COPY result fetching seem to work...the default query fetch method (COPY) is not supported,connection_get_info()
fails, and at a glance,connection_get_objects()
might be returning incorrect results (and fails at the column depth).Created on 2024-10-04 with reprex v2.1.1