-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added embeddings/gen example #362
Conversation
for more information, see https://pre-commit.ci
Deploying datachain-documentation with Cloudflare Pages
|
Related to #353 |
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #362 +/- ##
=======================================
Coverage 86.69% 86.69%
=======================================
Files 92 92
Lines 9789 9789
Branches 2019 2019
=======================================
Hits 8487 8487
Misses 946 946
Partials 356 356
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
dc = ( | ||
DataChain.from_storage(source) | ||
.settings(parallel=-1) | ||
.filter(C.file.path.glob("*.pdf")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we put some limit here, at least for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could parameterise the source variable (as per examples/multimodal/wds.py
) to only ingest a subset of files in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only ingested what was in neurips
... but I now reduced it further to papers from 1987
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a bug in the Windows CI where parallel
mode breaks. I've put up #424 as an attempt to fix the issue.
Looks like the example won't run on Windows due to |
@tibor-mach is this one done? any progress on it? what are the blockers? |
The new example needs parameterized as it takes > 1hr to run on Windows. |
@mattseddon How would you go about that
I'm not quite sure why this is the case ... I downgraded the onnx version as per your suggestion here. Now I limited the size of the dataset even more so let's see (but it wasn't overly huge before, it should not take nearly as much time if it is to be of any real use on windows). |
No description provided.