feat: add query command to ceramic-one #545

nathanielc · 2024-09-30T22:51:28Z

No description provided.

stbrody

is this going over http or via direct access to the underlying data store (sqlite)? I believe it's the latter - if so, can it run at the same time as another C1 process is running in Daemon mode against the same data?

I feel like a query REPL would be best pointed against a running C1 daemon over the network using the Flight APIs, rather than by direct access to the local data files.

nathanielc · 2024-10-08T21:14:27Z

is this going over http or via direct access to the underlying data store (sqlite)?

Good question, its neither. As written today it cannot access the data. When I demoed it last week I pointed it at my S3 bucket and then queried the data that way.

To me that means we should add a bit to configure your S3 bucket as part of the query command so that its ready to go. And in that configuration it will not compete with the sqlite file or any other local resource.

~~I'll change this to a draft until I can get to that change.~~ Can't change a PR back to a draft. Either way will wait to merge until it has a bit more utility than just a disconnected REPL

stbrody · 2024-10-09T14:11:15Z

To me that means we should add a bit to configure your S3 bucket as part of the query command so that its ready to go

Oh interesting, so then this wouldn't really be querying data in a Ceramic node at all, it'd be querying data in Parquet that was exported out from Ceramic via the OLAP aggregator? TBH that feels like a weird thing to include in the Ceramic CLI since at that point it kind of isn't really Ceramic data anymore?

nathanielc · 2024-10-09T14:57:40Z

TBH that feels like a weird thing to include in the Ceramic CLI since at that point it kind of isn't really Ceramic data anymore?

As we go down the path that more of the Ceramic data lives in parquet then this fits naturally. Additionally we are leaning towards bringing the OLAP aggregator into C1 and so such a distinction won't be meaningful to end users.

stbrody · 2024-10-09T15:11:10Z

I imagine that we're going to want a way for users to query the data that actually lives inside the ceramic node no? I would have assumed that that's what a query command in the ceramic CLI would do. If we wind up both with a way to query data within the Ceramic node as well as data that has been exported from the Ceramic to parquet in S3, then I just want to make sure it isn't confusing to users which is which. Maybe it's just a matter of thinking carefully about the names of the config flags for this command.

There are currently two tables in the datafusion _pipeline_ architecture, conclucsion_feed and doc_state. This change moves the initialization code for datafusion to interact with those tables into a new ceramic-pipeline crate where both the olap and query commands can reuse the logic. This new crate can be a home for future work in this _pipeline_ architecture as we flesh it out. For example we will likely move the aggregator/ceramic_patch logic into this crate. In either case the query command now knows how to query both tables when it first starts up. This table access is via flight sql and direct object store access.

nathanielc · 2024-10-14T20:18:45Z

@stbrody PR has been updated to auto configure connecting to the conclusion_feed table over flight sql and to the doc_state table over S3.

AaronGoldman · 2024-10-15T19:15:44Z

I think we should have a http Api endpoint to run queries with your node so that all the data in the SQLite file and some data exported to parquet files can query identically. By querying with c1 you should not need to understand the storage location but obviously if I want to use DuckDB(bring your own query engine) then you will need to tell those tools where to find the state store.

stbrody

lgtm mod questions about the argument names

stbrody · 2024-10-15T19:13:28Z

one/src/query.rs

+        default_value = "http://127.0.0.1:5102",
+        env = "CERAMIC_ONE_FLIGHT_SQL_ENDPOINT"
+    )]
+    flight_sql_endpoint: String,


I feel like the fact that it uses Flight as the API format is an implementation detail, the important thing is that this is a C1 node at the other end of this endpoint right? Should this be called ceramic_one_endpoint or something like that?

Yes, however ceramic-one exposes multiple endpoints(p2p, REST, flight sql). Maybe we call it ceramic_one_query_endpoint? As its purpose is to expose query access to the data?

yep I like that

Using ceramic_one_query_endpoint meant the env var was ceramic_one_ceramic_one_query_endpoint which is horrible. So I went with just query_endpoint as the ceramic_one context is already present

stbrody · 2024-10-15T19:14:44Z

one/src/query.rs

+    ///   * AWS_ALLOW_HTTP -> set to "true" to permit HTTP connections without TLS
+    ///
+    #[arg(long, env = "CERAMIC_ONE_AWS_BUCKET")]
+    aws_bucket: String,


kind of the inverse of my above comment. AWS is a specific platform, one we are notably not using. Should this say be s3_bucket or something like that? We're using the S3 API, but we aren't using the AWS platform...

I like s3_bucket, however the env vars will still be AWS prefixed as that is the accepted standard for how to configure s3 endpoint authentication.

sounds good

stbrody · 2024-10-15T19:18:55Z

I think we should have a http Api endpoint to run queries with your node

Isn't that what the flight sql endpoint already is?

nathanielc · 2024-10-15T19:52:49Z

I think we should have a http Api endpoint to run queries with your node

Isn't that what the flight sql endpoint already is?

In my opinion it is. Its not an HTTP/1 endpoint but trying to build an HTTP/1 endpoint to expose query access to the data will mean we basically reimplement flight sql.

nathanielc requested a review from AaronGoldman September 30, 2024 22:51

nathanielc requested review from stbrody and a team as code owners September 30, 2024 22:51

nathanielc requested review from dav1do and removed request for a team September 30, 2024 22:51

nathanielc temporarily deployed to github-tests-2024 September 30, 2024 23:10 — with GitHub Actions Inactive

dav1do approved these changes Sep 30, 2024

View reviewed changes

AaronGoldman approved these changes Oct 2, 2024

View reviewed changes

stbrody reviewed Oct 8, 2024

View reviewed changes

nathanielc marked this pull request as draft October 9, 2024 14:57

nathanielc added 2 commits October 14, 2024 08:04

feat: add query command to ceramic-one

101275c

nathanielc force-pushed the feat/query branch from e952d56 to 6f98819 Compare October 14, 2024 14:04

nathanielc temporarily deployed to github-tests-2024 October 14, 2024 14:25 — with GitHub Actions Inactive

feat: expose cid_string udf in query and flight sql

b8170e8

nathanielc marked this pull request as ready for review October 14, 2024 20:17

nathanielc requested a review from stbrody October 14, 2024 20:18

nathanielc temporarily deployed to github-tests-2024 October 14, 2024 20:38 — with GitHub Actions Inactive

fix: tests

d66ece2

nathanielc temporarily deployed to github-tests-2024 October 15, 2024 15:22 — with GitHub Actions Inactive

stbrody approved these changes Oct 15, 2024

View reviewed changes

fix: use better names for cli args

88e4437

nathanielc enabled auto-merge October 16, 2024 15:19

nathanielc temporarily deployed to github-tests-2024 October 16, 2024 15:39 — with GitHub Actions Inactive

nathanielc added this pull request to the merge queue Oct 16, 2024

Merged via the queue into main with commit 33f9ee9 Oct 16, 2024
5 checks passed

nathanielc deleted the feat/query branch October 16, 2024 16:22

smrz2001 mentioned this pull request Oct 17, 2024

chore: version v0.40.0 #564

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add query command to ceramic-one #545

feat: add query command to ceramic-one #545

nathanielc commented Sep 30, 2024

stbrody left a comment

nathanielc commented Oct 8, 2024 •

edited

Loading

stbrody commented Oct 9, 2024

nathanielc commented Oct 9, 2024

stbrody commented Oct 9, 2024

nathanielc commented Oct 14, 2024

AaronGoldman commented Oct 15, 2024

stbrody left a comment

stbrody Oct 15, 2024

nathanielc Oct 15, 2024

stbrody Oct 15, 2024

nathanielc Oct 16, 2024

stbrody Oct 15, 2024

nathanielc Oct 15, 2024

stbrody Oct 15, 2024

stbrody commented Oct 15, 2024

nathanielc commented Oct 15, 2024

feat: add query command to ceramic-one #545

feat: add query command to ceramic-one #545

Conversation

nathanielc commented Sep 30, 2024

stbrody left a comment

Choose a reason for hiding this comment

nathanielc commented Oct 8, 2024 • edited Loading

stbrody commented Oct 9, 2024

nathanielc commented Oct 9, 2024

stbrody commented Oct 9, 2024

nathanielc commented Oct 14, 2024

AaronGoldman commented Oct 15, 2024

stbrody left a comment

Choose a reason for hiding this comment

stbrody Oct 15, 2024

Choose a reason for hiding this comment

nathanielc Oct 15, 2024

Choose a reason for hiding this comment

stbrody Oct 15, 2024

Choose a reason for hiding this comment

nathanielc Oct 16, 2024

Choose a reason for hiding this comment

stbrody Oct 15, 2024

Choose a reason for hiding this comment

nathanielc Oct 15, 2024

Choose a reason for hiding this comment

stbrody Oct 15, 2024

Choose a reason for hiding this comment

stbrody commented Oct 15, 2024

nathanielc commented Oct 15, 2024

nathanielc commented Oct 8, 2024 •

edited

Loading