Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataChain.listings() method and use it in getting storages #331

Merged
merged 28 commits into from
Sep 5, 2024

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Aug 20, 2024

This PR adds class method similar to DataChain.datasets() called DataChain.listings() which returns list of ListingInfo objects from chain.
ListingInfo describes one specific cached listing (version) and is simply sub-classing DatasetInfo which is returned from mentioned DataChain.datasets() as listing is just special form of dataset.
The idea is to use this instead of deprecated Storage class which will be, along side with related codebase, removed in upcoming followups.

Additional changes:

  1. Using Dataset.listings() in CLI method ls_local which lists storages / listings
  2. Removed couple of unused Catalog methods: unlist_source, storage_stats, ls_storage_uris, get_storage, ls_storages and underlying metastore methods
  3. Fixing dataset dependencies as now there is no special "storage" table which can be references and storage type dependency is also pointing to dataset (just listing type of dataset)

Blocked by #294

@ilongin ilongin marked this pull request as draft August 20, 2024 23:40
@ilongin ilongin changed the base branch from main to ilongin/266-refactor-from-storage August 20, 2024 23:41
@ilongin ilongin linked an issue Aug 20, 2024 that may be closed by this pull request
Copy link

cloudflare-workers-and-pages bot commented Aug 20, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2d7fb35
Status: ✅  Deploy successful!
Preview URL: https://d9141fbd.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-329-refactor-storage.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 92.06349% with 5 lines in your changes missing coverage. Please review.

Project coverage is 86.81%. Comparing base (2952eb1) to head (2d7fb35).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/dataset.py 84.21% 1 Missing and 2 partials ⚠️
src/datachain/lib/listing_info.py 91.30% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #331      +/-   ##
==========================================
+ Coverage   86.78%   86.81%   +0.03%     
==========================================
  Files          92       93       +1     
  Lines       10069    10078       +9     
  Branches     2047     2048       +1     
==========================================
+ Hits         8738     8749      +11     
+ Misses        988      984       -4     
- Partials      343      345       +2     
Flag Coverage Δ
datachain 86.74% <92.06%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from ilongin/266-refactor-from-storage to main September 2, 2024 09:21
@ilongin ilongin force-pushed the ilongin/329-refactor-storages branch from 2e125ae to 8537347 Compare September 2, 2024 12:52
@ilongin ilongin marked this pull request as ready for review September 2, 2024 13:21
dependency_name = dataset_name

if is_listing_dataset(dataset_name):
dependency_type = DatasetDependencyType.STORAGE # type: ignore[arg-type]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use LISTING as a type (since we use listing as term in some other places)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this as well. Not sure, but both is correct IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but both is correct IMO

yes, but it just make it simpler to maintain, search, understand

not critical, but I would try to go everywhere with the same term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider that in upcoming PRs just to not block this one

object_name: str = "listing",
**kwargs,
) -> "DataChain":
"""Generate chain with list of cached listing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listing -> listings

Q: do we really need to make it a public dc method? (I don't have any strong opinion on this - just thinking if we can for now expose less and see if there are scenarios for this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean we will need from_listing btw? (just trying to think if we can keep at all about datasets / datachain and avoid exposing "listing" concept to end users (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we do have usage of both listings and storages words around interface which is ok IMO. Listing is just listed storage so I don't think those two words cannot live together.
I don't think we need from_listing as we have from_storage which in background creates new listing if doesn't exist and uses it then.

I think it's ok for us to expose this as we will be using it in Studio for example which can also be seen as one client.

object_name: str = "listing",
**kwargs,
) -> "DataChain":
"""Generate chain with list of cached listing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to explain a bit what the listing is - just a few words

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one sentence

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my only question if we really want to expose this as method that returns datachain (what is the use case for this?)

also the name listing vs storage that we use in some places

@ilongin
Copy link
Contributor Author

ilongin commented Sep 4, 2024

LGTM, my only question if we really want to expose this as method that returns datachain (what is the use case for this?)

also the name listing vs storage that we use in some places

About exposing it as a method that returns DataChain instance, I wanted to be consistent with similar .datasets() method. If we want to change this, both methods needs to be refactored together

@ilongin ilongin merged commit 576b69a into main Sep 5, 2024
38 checks passed
@ilongin ilongin deleted the ilongin/329-refactor-storages branch September 5, 2024 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Catalog storages methods to use new listing mechanism
2 participants