-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add query command to ceramic-one #545
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going over http or via direct access to the underlying data store (sqlite)? I believe it's the latter - if so, can it run at the same time as another C1 process is running in Daemon mode against the same data?
I feel like a query REPL would be best pointed against a running C1 daemon over the network using the Flight APIs, rather than by direct access to the local data files.
Good question, its neither. As written today it cannot access the data. When I demoed it last week I pointed it at my S3 bucket and then queried the data that way. To me that means we should add a bit to configure your S3 bucket as part of the query command so that its ready to go. And in that configuration it will not compete with the sqlite file or any other local resource.
|
Oh interesting, so then this wouldn't really be querying data in a Ceramic node at all, it'd be querying data in Parquet that was exported out from Ceramic via the OLAP aggregator? TBH that feels like a weird thing to include in the Ceramic CLI since at that point it kind of isn't really Ceramic data anymore? |
As we go down the path that more of the Ceramic data lives in parquet then this fits naturally. Additionally we are leaning towards bringing the OLAP aggregator into C1 and so such a distinction won't be meaningful to end users. |
I imagine that we're going to want a way for users to query the data that actually lives inside the ceramic node no? I would have assumed that that's what a |
There are currently two tables in the datafusion _pipeline_ architecture, conclucsion_feed and doc_state. This change moves the initialization code for datafusion to interact with those tables into a new ceramic-pipeline crate where both the olap and query commands can reuse the logic. This new crate can be a home for future work in this _pipeline_ architecture as we flesh it out. For example we will likely move the aggregator/ceramic_patch logic into this crate. In either case the query command now knows how to query both tables when it first starts up. This table access is via flight sql and direct object store access.
e952d56
to
6f98819
Compare
@stbrody PR has been updated to auto configure connecting to the |
I think we should have a http Api endpoint to run queries with your node so that all the data in the SQLite file and some data exported to parquet files can query identically. By querying with c1 you should not need to understand the storage location but obviously if I want to use DuckDB(bring your own query engine) then you will need to tell those tools where to find the state store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm mod questions about the argument names
one/src/query.rs
Outdated
default_value = "http://127.0.0.1:5102", | ||
env = "CERAMIC_ONE_FLIGHT_SQL_ENDPOINT" | ||
)] | ||
flight_sql_endpoint: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the fact that it uses Flight as the API format is an implementation detail, the important thing is that this is a C1 node at the other end of this endpoint right? Should this be called ceramic_one_endpoint or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, however ceramic-one exposes multiple endpoints(p2p, REST, flight sql). Maybe we call it ceramic_one_query_endpoint
? As its purpose is to expose query access to the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ceramic_one_query_endpoint
meant the env var was ceramic_one_ceramic_one_query_endpoint
which is horrible. So I went with just query_endpoint
as the ceramic_one
context is already present
one/src/query.rs
Outdated
/// * AWS_ALLOW_HTTP -> set to "true" to permit HTTP connections without TLS | ||
/// | ||
#[arg(long, env = "CERAMIC_ONE_AWS_BUCKET")] | ||
aws_bucket: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of the inverse of my above comment. AWS is a specific platform, one we are notably not using. Should this say be s3_bucket or something like that? We're using the S3 API, but we aren't using the AWS platform...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like s3_bucket
, however the env vars will still be AWS prefixed as that is the accepted standard for how to configure s3 endpoint authentication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Isn't that what the flight sql endpoint already is? |
In my opinion it is. Its not an HTTP/1 endpoint but trying to build an HTTP/1 endpoint to expose query access to the data will mean we basically reimplement flight sql. |
No description provided.