Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements for SODA Crystal - Add your pain points OR use cases OR feature requirements #9

Open
skdwriting opened this issue Nov 30, 2023 · 9 comments

Comments

@skdwriting
Copy link
Collaborator

skdwriting commented Nov 30, 2023

Issue/Feature Description:
You can add all your pain points OR use cases OR feature requirements for SODA Crystal project
Project Focus : Unstructured Metadata management

You can add all your inputs in the comments. We will brainstorm to bring the first list

Reference
You can refer to some of the basic information collected or prepared for SODA Crystal here

Example 1:
We are struggling with metadata search for s3, especially the performance. Please find our concern details here OR attach the information.

Example 2:
I found a project which handles unstructured metadata management. Can we take inputs from there?

Example 3:
Feature request : Storing huge amount of IOT data in a common format can be a good feature? <More info here - link can be added or attach>

Example 4:
Do we consider data lake pain points like ?

@skdwriting
Copy link
Collaborator Author

Adding the input from @thatsdone here #8

@thatsdone
Copy link

(Possible) Feature Request

Add OCI (Oracle Cloud Infrastructure) Object Storage Service support.

As I think we need to discuss support matrix topic with Strato, I filed an issue (sodafoundation/strato#1425) there too.

@skdwriting
Copy link
Collaborator Author

Comments from Rakesh, IBM:

@skdwriting
Copy link
Collaborator Author

Comments from Rakesh, IBM (SODA TOC) - Apache Parquet file management for large data set, schema, - there are challenges. Lake House solution solves (Apache Iceberg, Hudi). However if we can provide a specific solution then many organizations, it will be useful where they are not able to deploy lake house kind of solution.

@dinkar--s
Copy link

From the competitive analysis, it looks like there could be two different focus areas for Crystal - (1) intelligent search of data sets and (2) data management using metadata. These two would have different requirements.

Searching inside unstructured data could be an area to explore (Subhankar)

@dinkar--s
Copy link

We should keep track of old metadata for objects as well as new metadata. This will allow us to answer queries such as (1) what is the rate at which unstructured data is growing (2) Which are the fastest growing datasets (3) which department is adding data at the fastest rate (4) what should be the backup strategy for a dataset based upon how fast it is growing

@rakeshJn
Copy link

Hi, I want to add one use case here:
For foundation models training, a lot of data is collected and is usually stored in cloud object store (COS). The data is normally in parquet files. If a Lakehouse kind of solution is not used, these files are very difficult to manage. Some issues are:

  1. Get me the list of all the datasets we have (a bunch of parquet files in a specific folder represent a dataset, for example)
  2. When the dataset was last updated
  3. What is the size of each dataset, how many files are there. What is the folder hierarchy.
  4. What is the name and size of each file, size distribution (how many files in size ranges 0-100MB, 100-200MB and so on).
  5. Now, if we want we can go further - at folder level, get schema of parquet file (perhaps only first file) and show all the fields available in the parquet.
  6. There is lot more information available in parquet metadata, but that is relevant for each file, and that might be an extensive retrieval and storage.
    Once all this information is available, one should be able to search for folder name (which represent dataset name) etc.

@dinkar--s
Copy link

dinkar--s commented Oct 1, 2024

To provide compelling features against the competition, Crystal should be extensible to support semantic understanding of the metadata

@dinkar--s
Copy link

To provide compelling features against the competition, Crystal should be extensible to support multiple query languages including natural language

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants