Fix actions #111

mruwnik · 2023-08-03T21:16:43Z

Get all datasets working with the database:

alignment_newsletter creates summaries
reports is removed, as it's now a duplicate of other ones
actions cleaned up

ccstan99

Looks good! You should be able to add the needed keys too.

ccstan99 · 2023-08-04T07:31:19Z

.github/workflows/fetch-dataset.yml

@@ -55,7 +54,6 @@ on:
          - qualiacomputing


nonarxiv_paper can be removed along with report. They've both been moved to either xml or pdfs.

qualiacomputing can also be removed since just Murphant helped us identify the relevant pages were added to special doc, just need metadata and get scraped.

nonarxiv_paper was going through the xml files. I renamed it to be explicit

ccstan99 · 2023-08-04T07:33:50Z

.github/workflows/upload-to-huggingface.yml

+  workflow_dispatch: # allow manual triggering
+  schedule:
+    - cron: "0 3 * * 0"  # Every Sunday at 3 AM
+


Are we pushing to HF? Or is HF pulling live data from MySQL?

pushing to HF. The HF dataloader class thingy can pull from SQL, but the datasets on their page are static files which can be used by anyone. This step isn't really needed - it's just a nice bonus thing which might as well be left in

ccstan99 · 2023-08-04T07:36:05Z

README.md

@@ -20,8 +20,8 @@ The following list of sources may change and items may be renamed:
 - [deepmind_blog](https://deepmindsafetyresearch.medium.com/)
 - [distill](https://distill.pub/)
 - [eaforum](https://forum.effectivealtruism.org/) - selected posts
+- ebooks - books include [Superintelligence](https://www.goodreads.com/book/show/20527133-superintelligence), [Human Compatible](https://www.goodreads.com/book/show/44767248-human-compatible), [Life 3.0](https://www.goodreads.com/book/show/34272565-life-3-0), [The Precipice](https://www.goodreads.com/book/show/50485582-the-precipice), and others


Are the copyright books back in? Maybe don't advertise this until we get permissions straightened out? If pinecone embeddings is now done in ARD instead of stampy-chat, we can just make sure the copyrighted material doesn't go to HF?

dunno why this was added here...

ccstan99 · 2023-08-04T07:43:59Z

align_data/common/alignment_dataset.py

+
+
+class SummaryDataset(AlignmentDataset):
+


FYI importai and ml_safety_newsletter also function as summaries. It's probably buried in one of the issues somewhere.

Should we treat arxiv abstracts as a summary too? My thought is this would be the field we search when some asks for "the blog or paper about ..."

I did an explicit issue for them - #113

abstracts should now be saved as summaries

mruwnik · 2023-08-04T10:22:56Z

Both the daily and weekly synchs worked, so now all the datasets should be autoingested :D

Thomas-Lemoine · 2023-08-07T07:02:41Z

align_data/common/alignment_dataset.py

+            article.authors = ','.join(article.authors[:1024].split(',')[:-1])
+        return article
+
+    def make_data_entry(self, data, **kwargs) -> Article:


This is much better! I would have expected _add_authors to be some Article method like set_authors, but this version has the advantage of being right next to the method that uses it so potentially easier to keep track of

mruwnik requested review from ccstan99, henri123lemoine and Thomas-Lemoine August 3, 2023 21:16

mruwnik force-pushed the fix-actions branch 3 times, most recently from 3783f8c to 06ccd55 Compare August 3, 2023 22:05

ccstan99 previously approved these changes Aug 4, 2023

View reviewed changes

ccstan99 reviewed Aug 4, 2023

View reviewed changes

mruwnik added 7 commits August 4, 2023 11:53

Remove unused datasets

9343cba

remove reports

fabd83a

remove GdocsDataset

b12596e

alignment newsletter

7786b2c

update actions names

7a4c49d

weekly HF sync

d3461fc

PR changes

ac983fc

mruwnik dismissed ccstan99’s stale review via ac983fc August 4, 2023 10:19

mruwnik force-pushed the fix-actions branch from dc5df5e to ac983fc Compare August 4, 2023 10:19

mruwnik requested a review from ccstan99 August 4, 2023 10:23

Thomas-Lemoine reviewed Aug 7, 2023

View reviewed changes

Thomas-Lemoine approved these changes Aug 7, 2023

View reviewed changes

mruwnik merged commit ebf3481 into main Aug 7, 2023

mruwnik deleted the fix-actions branch August 7, 2023 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix actions #111

Fix actions #111

mruwnik commented Aug 3, 2023

ccstan99 left a comment

ccstan99 Aug 4, 2023

mruwnik Aug 4, 2023

ccstan99 Aug 4, 2023

mruwnik Aug 4, 2023

ccstan99 Aug 4, 2023

mruwnik Aug 4, 2023

ccstan99 Aug 4, 2023

mruwnik Aug 4, 2023

mruwnik Aug 4, 2023

mruwnik commented Aug 4, 2023

Thomas-Lemoine Aug 7, 2023

Fix actions #111

Fix actions #111

Conversation

mruwnik commented Aug 3, 2023

ccstan99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruwnik commented Aug 4, 2023

Choose a reason for hiding this comment