Requirements for SODA Crystal - Add your pain points OR use cases OR feature requirements #9

skdwriting · 2023-11-30T03:48:58Z

Issue/Feature Description:
You can add all your pain points OR use cases OR feature requirements for SODA Crystal project
Project Focus : Unstructured Metadata management

You can add all your inputs in the comments. We will brainstorm to bring the first list

Reference
You can refer to some of the basic information collected or prepared for SODA Crystal here

Example 1:
We are struggling with metadata search for s3, especially the performance. Please find our concern details here OR attach the information.

Example 2:
I found a project which handles unstructured metadata management. Can we take inputs from there?

Example 3:
Feature request : Storing huge amount of IOT data in a common format can be a good feature? <More info here - link can be added or attach>

Example 4:
Do we consider data lake pain points like ?

skdwriting · 2023-12-01T01:51:20Z

Adding the input from @thatsdone here #8

thatsdone · 2023-12-05T00:52:23Z

(Possible) Feature Request

Add OCI (Oracle Cloud Infrastructure) Object Storage Service support.

As I think we need to discuss support matrix topic with Strato, I filed an issue (sodafoundation/strato#1425) there too.

skdwriting · 2024-02-23T07:45:06Z

Comments from Rakesh, IBM:

skdwriting · 2024-02-23T07:46:20Z

Comments from Rakesh, IBM (SODA TOC) - Apache Parquet file management for large data set, schema, - there are challenges. Lake House solution solves (Apache Iceberg, Hudi). However if we can provide a specific solution then many organizations, it will be useful where they are not able to deploy lake house kind of solution.

dinkar--s · 2024-03-18T11:28:33Z

From the competitive analysis, it looks like there could be two different focus areas for Crystal - (1) intelligent search of data sets and (2) data management using metadata. These two would have different requirements.

Searching inside unstructured data could be an area to explore (Subhankar)

dinkar--s · 2024-09-19T11:53:16Z

We should keep track of old metadata for objects as well as new metadata. This will allow us to answer queries such as (1) what is the rate at which unstructured data is growing (2) Which are the fastest growing datasets (3) which department is adding data at the fastest rate (4) what should be the backup strategy for a dataset based upon how fast it is growing

rakeshJn · 2024-09-30T19:54:23Z

Hi, I want to add one use case here:
For foundation models training, a lot of data is collected and is usually stored in cloud object store (COS). The data is normally in parquet files. If a Lakehouse kind of solution is not used, these files are very difficult to manage. Some issues are:

Get me the list of all the datasets we have (a bunch of parquet files in a specific folder represent a dataset, for example)
When the dataset was last updated
What is the size of each dataset, how many files are there. What is the folder hierarchy.
What is the name and size of each file, size distribution (how many files in size ranges 0-100MB, 100-200MB and so on).
Now, if we want we can go further - at folder level, get schema of parquet file (perhaps only first file) and show all the fields available in the parquet.
There is lot more information available in parquet metadata, but that is relevant for each file, and that might be an extensive retrieval and storage.
Once all this information is available, one should be able to search for folder name (which represent dataset name) etc.

dinkar--s · 2024-10-01T07:23:29Z

To provide compelling features against the competition, Crystal should be extensible to support semantic understanding of the metadata

dinkar--s · 2024-10-01T07:24:46Z

To provide compelling features against the competition, Crystal should be extensible to support multiple query languages including natural language

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requirements for SODA Crystal - Add your pain points OR use cases OR feature requirements #9

Requirements for SODA Crystal - Add your pain points OR use cases OR feature requirements #9

skdwriting commented Nov 30, 2023 •

edited

Loading

skdwriting commented Dec 1, 2023

thatsdone commented Dec 5, 2023

skdwriting commented Feb 23, 2024

skdwriting commented Feb 23, 2024

dinkar--s commented Mar 18, 2024

dinkar--s commented Sep 19, 2024

rakeshJn commented Sep 30, 2024

dinkar--s commented Oct 1, 2024 •

edited

Loading

dinkar--s commented Oct 1, 2024

Requirements for SODA Crystal - Add your pain points OR use cases OR feature requirements #9

Requirements for SODA Crystal - Add your pain points OR use cases OR feature requirements #9

Comments

skdwriting commented Nov 30, 2023 • edited Loading

skdwriting commented Dec 1, 2023

thatsdone commented Dec 5, 2023

skdwriting commented Feb 23, 2024

skdwriting commented Feb 23, 2024

dinkar--s commented Mar 18, 2024

dinkar--s commented Sep 19, 2024

rakeshJn commented Sep 30, 2024

dinkar--s commented Oct 1, 2024 • edited Loading

dinkar--s commented Oct 1, 2024

skdwriting commented Nov 30, 2023 •

edited

Loading

dinkar--s commented Oct 1, 2024 •

edited

Loading