Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose all tables via flight sql #571

Merged
merged 5 commits into from
Nov 6, 2024
Merged

Conversation

nathanielc
Copy link
Collaborator

@nathanielc nathanielc commented Oct 28, 2024

This change moves the OLAP aggregation logic within ceramic-one. This means that all tables are available via Flight SQL endpoint.

TODO:

  • Manually test query command

@nathanielc nathanielc requested review from stbrody and a team as code owners October 28, 2024 21:40
@nathanielc nathanielc requested review from smrz2001 and removed request for a team October 28, 2024 21:40
@nathanielc nathanielc force-pushed the feat/single-binary-olap branch from 9e30ee6 to 3deb641 Compare October 28, 2024 21:41
@stbrody
Copy link
Collaborator

stbrody commented Oct 29, 2024

Going to wait to review until CI is green

@nathanielc nathanielc temporarily deployed to github-tests-2024 October 30, 2024 21:40 — with GitHub Actions Inactive
@nathanielc
Copy link
Collaborator Author

@stbrody Ready for review now. Turns out there was a subtle issue with the federation used with the query command. So instead of trying to fix it I have changed the query command to be a simple client of the FlightSQL endpoint instead of full featured REPL

@nathanielc nathanielc force-pushed the feat/single-binary-olap branch from 2f6e6c6 to 2cf7a61 Compare November 5, 2024 20:15
@nathanielc nathanielc temporarily deployed to github-tests-2024 November 5, 2024 20:34 — with GitHub Actions Inactive
Copy link
Collaborator

@stbrody stbrody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few small comments but generally LGTM.

One question: where does this leave folks who want to build their own custom aggregators? Can they still do so out-of-process by consuming the conclusion feed over the network? Does this PR make this more difficult?

requires = "flight_sql_bind_address",
env = "CERAMIC_ONE_AGGREGATOR"
)]
aggregator: Option<bool>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think if this should be a state enum instead of a boolean, but I'm having a hard time thinking about what other modes the aggregator might have in the future. I guess since I can't think of anything than just a bool is probably fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is also to remove this boolean once we make it always on. We will likely need some form of config for which parts of the pipeline to enable. That will likely be a net new config option once we have move than one stage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the name of the flag be enable_aggregator maybe?

#[arg(
long,
requires = "experimental_features",
env = "CERAMIC_ONE_S3_BUCKET"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future the data files might live in a different bucket than the aggregator output right? So we might want this name to indicate that this specifically for the result of the aggregated state that we are exporting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am anticipating a single bucket for all the data. A single bucket and handle many tables. I don't want to complicate the API by expecting a different bucket for each table until that need arises in product discussions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment to reflect this intention

aggregator: Option<bool>,

/// S3 bucket name.
/// When configured the aggregator will support storing data in S3 compatible object stores.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will support? or will actually do so?

@@ -469,7 +497,7 @@ pub async fn run(opts: DaemonOpts) -> Result<()> {
.await
.map_err(|e| {
anyhow!(
"Failed to start libp2p server using addresses: {}. {}",
"Failed to start linebp2p server using addresses: {}. {}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

/// Enable the aggregator, requires Flight SQL to be enabled
#[arg(
long,
requires = "flight_sql_bind_address",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also requires the s3 bucket?

let _ = shutdown_signal.recv().await;
let bucket = opts
.s3_bucket
.ok_or_else(|| anyhow!("s3_bucket option is required when exposing flight sql"))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be required? What if you just want the run the aggregation locally and query out the aggregated state feed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today its required as there is no mechanism for storing the state locally. We should likely add that for dev purposes but that is a separate issue.

@nathanielc
Copy link
Collaborator Author

One question: where does this leave folks who want to build their own custom aggregators? Can they still do so out-of-process by consuming the conclusion feed over the network? Does this PR make this more difficult?

This does not make it more difficult. Consumers can still build their own aggregators out of band. In fact this makes is easier as all the data for any stage in the pipeline can be found in a single location.

This change moves the OLAP aggregation logic within ceramic-one. This
means that all tables are available via Flight SQL endpoint.
The REPL query command using federation had many gotchas that were hard
to predict. For example using a count query would confuse the federation
optimizer rules. This changes the query command to be a plain (non REPL)
client of the FlightSQL endpoint.
@nathanielc nathanielc force-pushed the feat/single-binary-olap branch from 6b18092 to e90b766 Compare November 6, 2024 21:05
@stbrody
Copy link
Collaborator

stbrody commented Nov 6, 2024

This does not make it more difficult. Consumers can still build their own aggregators out of band. In fact this makes is easier as all the data for any stage in the pipeline can be found in a single location.

This does remove all the tooling for building an external validator though right? Now someone would need to not just make their specific aggregator, but also a build process for a standalone aggregator binary

@nathanielc
Copy link
Collaborator Author

This does remove all the tooling for building an external validator though right? Now someone would need to not just make their specific aggregator, but also a build process for a standalone aggregator binary

That was true before however they could have copied the olap binary. But it was quite small and didn't have anything interesting to preserve. It was a blind copy of the ceramic-one binary setup.

@nathanielc nathanielc added this pull request to the merge queue Nov 6, 2024
Merged via the queue into main with commit 7817be9 Nov 6, 2024
5 checks passed
@nathanielc nathanielc deleted the feat/single-binary-olap branch November 6, 2024 22:06
This was referenced Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants