Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added embeddings/gen example #362

Merged
merged 13 commits into from
Sep 18, 2024
Merged

added embeddings/gen example #362

merged 13 commits into from
Sep 18, 2024

Conversation

tibor-mach
Copy link
Contributor

No description provided.

Copy link

cloudflare-workers-and-pages bot commented Aug 27, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: c6d1c1e
Status: ✅  Deploy successful!
Preview URL: https://89fcb377.datachain-documentation.pages.dev
Branch Preview URL: https://consolidation.datachain-documentation.pages.dev

View logs

@tibor-mach
Copy link
Contributor Author

Related to #353

Copy link

codecov bot commented Aug 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.69%. Comparing base (80f4fbe) to head (c6d1c1e).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #362   +/-   ##
=======================================
  Coverage   86.69%   86.69%           
=======================================
  Files          92       92           
  Lines        9789     9789           
  Branches     2019     2019           
=======================================
  Hits         8487     8487           
  Misses        946      946           
  Partials      356      356           
Flag Coverage Δ
datachain 86.63% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dc = (
DataChain.from_storage(source)
.settings(parallel=-1)
.filter(C.file.path.glob("*.pdf"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put some limit here, at least for tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could parameterise the source variable (as per examples/multimodal/wds.py) to only ingest a subset of files in the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only ingested what was in neurips ... but I now reduced it further to papers from 1987

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug in the Windows CI where parallel mode breaks. I've put up #424 as an attempt to fix the issue.

@mattseddon
Copy link
Member

Looks like the example won't run on Windows due to ImportError: DLL load failed while importing onnx_cpp2py_export: A dynamic link library (DLL) initialization routine failed.. You'll need to poke around in unstructured to see what version of onnx they are using and whether or not it can be downgraded/pinned to fix the issue. This issue is likely the cause.

@shcheklein
Copy link
Member

@tibor-mach is this one done? any progress on it? what are the blockers?

@mattseddon
Copy link
Member

@tibor-mach is this one done? any progress on it? what are the blockers?

The new example needs parameterized as it takes > 1hr to run on Windows.

@tibor-mach
Copy link
Contributor Author

@mattseddon How would you go about that

The new example needs parameterized as it takes > 1hr to run on Windows.

I'm not quite sure why this is the case ... I downgraded the onnx version as per your suggestion here. Now I limited the size of the dataset even more so let's see (but it wasn't overly huge before, it should not take nearly as much time if it is to be of any real use on windows).

@mattseddon mattseddon merged commit 6001589 into main Sep 18, 2024
38 checks passed
@mattseddon mattseddon deleted the consolidation branch September 18, 2024 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants