Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add query command to ceramic-one #545

Merged
merged 5 commits into from
Oct 16, 2024
Merged

feat: add query command to ceramic-one #545

merged 5 commits into from
Oct 16, 2024

Conversation

nathanielc
Copy link
Collaborator

No description provided.

@nathanielc nathanielc requested review from stbrody and a team as code owners September 30, 2024 22:51
@nathanielc nathanielc requested review from dav1do and removed request for a team September 30, 2024 22:51
Copy link
Collaborator

@stbrody stbrody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going over http or via direct access to the underlying data store (sqlite)? I believe it's the latter - if so, can it run at the same time as another C1 process is running in Daemon mode against the same data?

I feel like a query REPL would be best pointed against a running C1 daemon over the network using the Flight APIs, rather than by direct access to the local data files.

@nathanielc
Copy link
Collaborator Author

nathanielc commented Oct 8, 2024

is this going over http or via direct access to the underlying data store (sqlite)?

Good question, its neither. As written today it cannot access the data. When I demoed it last week I pointed it at my S3 bucket and then queried the data that way.

To me that means we should add a bit to configure your S3 bucket as part of the query command so that its ready to go. And in that configuration it will not compete with the sqlite file or any other local resource.

I'll change this to a draft until I can get to that change. Can't change a PR back to a draft. Either way will wait to merge until it has a bit more utility than just a disconnected REPL

@stbrody
Copy link
Collaborator

stbrody commented Oct 9, 2024

To me that means we should add a bit to configure your S3 bucket as part of the query command so that its ready to go

Oh interesting, so then this wouldn't really be querying data in a Ceramic node at all, it'd be querying data in Parquet that was exported out from Ceramic via the OLAP aggregator? TBH that feels like a weird thing to include in the Ceramic CLI since at that point it kind of isn't really Ceramic data anymore?

@nathanielc
Copy link
Collaborator Author

TBH that feels like a weird thing to include in the Ceramic CLI since at that point it kind of isn't really Ceramic data anymore?

As we go down the path that more of the Ceramic data lives in parquet then this fits naturally. Additionally we are leaning towards bringing the OLAP aggregator into C1 and so such a distinction won't be meaningful to end users.

@nathanielc nathanielc marked this pull request as draft October 9, 2024 14:57
@stbrody
Copy link
Collaborator

stbrody commented Oct 9, 2024

I imagine that we're going to want a way for users to query the data that actually lives inside the ceramic node no? I would have assumed that that's what a query command in the ceramic CLI would do. If we wind up both with a way to query data within the Ceramic node as well as data that has been exported from the Ceramic to parquet in S3, then I just want to make sure it isn't confusing to users which is which. Maybe it's just a matter of thinking carefully about the names of the config flags for this command.

There are currently two tables in the datafusion _pipeline_
architecture, conclucsion_feed and doc_state.

This change moves the initialization code for datafusion to interact
with those tables into a new ceramic-pipeline crate where both the olap
and query commands can reuse the logic. This new crate can be a home for
future work in this _pipeline_ architecture as we flesh it out. For
example we will likely move the aggregator/ceramic_patch logic into this crate.

In either case the query command now knows how to query both tables when
it first starts up. This table access is via flight sql and direct
object store access.
@nathanielc nathanielc marked this pull request as ready for review October 14, 2024 20:17
@nathanielc nathanielc requested a review from stbrody October 14, 2024 20:18
@nathanielc
Copy link
Collaborator Author

@stbrody PR has been updated to auto configure connecting to the conclusion_feed table over flight sql and to the doc_state table over S3.

@nathanielc nathanielc temporarily deployed to github-tests-2024 October 14, 2024 20:38 — with GitHub Actions Inactive
@nathanielc nathanielc temporarily deployed to github-tests-2024 October 15, 2024 15:22 — with GitHub Actions Inactive
@AaronGoldman
Copy link
Contributor

I think we should have a http Api endpoint to run queries with your node so that all the data in the SQLite file and some data exported to parquet files can query identically. By querying with c1 you should not need to understand the storage location but obviously if I want to use DuckDB(bring your own query engine) then you will need to tell those tools where to find the state store.

Copy link
Collaborator

@stbrody stbrody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm mod questions about the argument names

one/src/query.rs Outdated
default_value = "http://127.0.0.1:5102",
env = "CERAMIC_ONE_FLIGHT_SQL_ENDPOINT"
)]
flight_sql_endpoint: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the fact that it uses Flight as the API format is an implementation detail, the important thing is that this is a C1 node at the other end of this endpoint right? Should this be called ceramic_one_endpoint or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however ceramic-one exposes multiple endpoints(p2p, REST, flight sql). Maybe we call it ceramic_one_query_endpoint? As its purpose is to expose query access to the data?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I like that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ceramic_one_query_endpoint meant the env var was ceramic_one_ceramic_one_query_endpoint which is horrible. So I went with just query_endpoint as the ceramic_one context is already present

one/src/query.rs Outdated
/// * AWS_ALLOW_HTTP -> set to "true" to permit HTTP connections without TLS
///
#[arg(long, env = "CERAMIC_ONE_AWS_BUCKET")]
aws_bucket: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of the inverse of my above comment. AWS is a specific platform, one we are notably not using. Should this say be s3_bucket or something like that? We're using the S3 API, but we aren't using the AWS platform...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like s3_bucket, however the env vars will still be AWS prefixed as that is the accepted standard for how to configure s3 endpoint authentication.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@stbrody
Copy link
Collaborator

stbrody commented Oct 15, 2024

I think we should have a http Api endpoint to run queries with your node

Isn't that what the flight sql endpoint already is?

@nathanielc
Copy link
Collaborator Author

I think we should have a http Api endpoint to run queries with your node

Isn't that what the flight sql endpoint already is?

In my opinion it is. Its not an HTTP/1 endpoint but trying to build an HTTP/1 endpoint to expose query access to the data will mean we basically reimplement flight sql.

@nathanielc nathanielc enabled auto-merge October 16, 2024 15:19
@nathanielc nathanielc temporarily deployed to github-tests-2024 October 16, 2024 15:39 — with GitHub Actions Inactive
@nathanielc nathanielc added this pull request to the merge queue Oct 16, 2024
Merged via the queue into main with commit 33f9ee9 Oct 16, 2024
5 checks passed
@nathanielc nathanielc deleted the feat/query branch October 16, 2024 16:22
@smrz2001 smrz2001 mentioned this pull request Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants