-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interface for NoSQL storage #214
base: master
Are you sure you want to change the base?
Conversation
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
…outs across platforms
…loading Minio-specific code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About 2/3 through the files - publishing some comments now
"python", | ||
"nodejs" | ||
], | ||
"modules": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we will list resource 'modules' here e.g. nosql, storage, queues. How would this be adapted in the case of applications - where each function might have different requirements - such that the code that processes this json and creates said resources can do so as seamlessly as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca That is an excellent question on which I'm working right now. At this moment, I think the best option would be to make the json a collection of configuration points, one general (benchmark's language), and the other with per-function config.
|
||
# Set initial data | ||
|
||
nosql_func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: perhaps this could be renamed to something more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca Good catch, will fix! :-)
@@ -3,6 +3,14 @@ | |||
sys.path.append(os.path.join(os.path.dirname(__file__), '.python_packages/lib/site-packages')) | |||
|
|||
|
|||
if 'NOSQL_STORAGE_DATABASE' in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this I was slightly misled into thinking that this must be set by user, though I see it is actually set in gcp.py. Perhaps a comment along the lines of "this env var is set in this function in this file" would help clarify
self._region = region | ||
self._cloud_resources = resources | ||
|
||
# Map benchmark -> orig_name -> table_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this?
pass | ||
|
||
@abstractmethod | ||
def writer_func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about write
or write_to_table
?
(3) Update cached data if anything new was created | ||
""" | ||
|
||
def create_benchmark_tables( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The singular form (create_benchmark_table
) would probably be more appropriate.
I am even wondering whether we could only have create_table
as a method and skip create_benchmark_table
altogether -- or is there a reason why this is done like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea was to force a single behavior on all systems: check cache, create a name, and create a table if necessary.
self.client = session.client( | ||
"dynamodb", | ||
region_name=region, | ||
aws_access_key_id=access_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't use these in my implementation - is there something here that requires them? Not a strong opinion either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca What do you mean here? The access keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I found that this will still work without them, but it's a small thing anyway 🙂
aws_secret_access_key=secret_key, | ||
) | ||
|
||
# Map benchmark -> orig_name -> table_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarifying what orig_name
means here would be useful - perhaps you could give an example to show exactly how the table name maps to the original name (I see below that it's quite a construction)
raise NotImplementedError() | ||
|
||
def remove_table(self, name: str) -> str: | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you still plan to use these/have already implemented them, but feel free to take inspiration from my branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca Thanks! I will try to cherry-pick the commits or copy the code :-)
of buckets. Buckets may be created or retrieved from cache. | ||
|
||
:param benchmark: benchmark name | ||
:param buckets: tuple of required input/output buckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually missing from the function signature/implementation. Perhaps the implementation is still incomplete?
Also, do we plan a similar approach for storage? To pass some parameters here and have everything immediately initialised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca Good catch, these are old comments. We changed the implementation :-)
for original_name, actual_name in nosql_storage.get_tables( | ||
code_package.benchmark | ||
).items(): | ||
envs[f"NOSQL_STORAGE_TABLE_{original_name}"] = actual_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are env vars the best approach in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca For object storage buckets, we had to pass their names as part of the input. I think that using envs is a better choice, as we don't have to adjust the input of each invocation.
Do you see a better way of doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other cloud resources might go in this file (considering the pretty generic name)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca I will add docs here - I simply separated the implementation from config, as with CosmosDB account the implementation of Azure resources became quite large.
Requires Azure CLI instance in Docker to obtain storage account details. | ||
|
||
:param benchmark: | ||
:param buckets: number of input and output buckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing here too
""" | ||
|
||
self.logging.info(f"Allocating a new Firestore database {database_name}") | ||
self._cli_instance.execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason why we are using the CLI to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca AFAIK there's no API way to allocate a Firestore database except of using gcloud
tool. Did you find it somewhere? It would greatly simplify the implementation :-)
|
||
return envs | ||
|
||
def _generate_function_envs(self, code_package: Benchmark) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other env vars do we think this function will have to support in the future (so apart from NoSQL-related)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oanarosca If we use Redis for something, we also need to pass Reddis access details.
We want to add NoSQL storage and a test benchmark implementing simple CRUD API - emulate a shopping cart, like on Amazon.