Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix actions #111

Merged
merged 7 commits into from
Aug 7, 2023
Merged

Fix actions #111

merged 7 commits into from
Aug 7, 2023

Conversation

mruwnik
Copy link
Collaborator

@mruwnik mruwnik commented Aug 3, 2023

Get all datasets working with the database:

  • alignment_newsletter creates summaries
  • reports is removed, as it's now a duplicate of other ones
  • actions cleaned up

@mruwnik mruwnik force-pushed the fix-actions branch 3 times, most recently from 3783f8c to 06ccd55 Compare August 3, 2023 22:05
ccstan99
ccstan99 previously approved these changes Aug 4, 2023
Copy link
Collaborator

@ccstan99 ccstan99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! You should be able to add the needed keys too.

@@ -55,7 +54,6 @@ on:
- qualiacomputing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonarxiv_paper can be removed along with report. They've both been moved to either xml or pdfs.

qualiacomputing can also be removed since just Murphant helped us identify the relevant pages were added to special doc, just need metadata and get scraped.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonarxiv_paper was going through the xml files. I renamed it to be explicit

workflow_dispatch: # allow manual triggering
schedule:
- cron: "0 3 * * 0" # Every Sunday at 3 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we pushing to HF? Or is HF pulling live data from MySQL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushing to HF. The HF dataloader class thingy can pull from SQL, but the datasets on their page are static files which can be used by anyone. This step isn't really needed - it's just a nice bonus thing which might as well be left in

README.md Outdated
@@ -20,8 +20,8 @@ The following list of sources may change and items may be renamed:
- [deepmind_blog](https://deepmindsafetyresearch.medium.com/)
- [distill](https://distill.pub/)
- [eaforum](https://forum.effectivealtruism.org/) - selected posts
- ebooks - books include [Superintelligence](https://www.goodreads.com/book/show/20527133-superintelligence), [Human Compatible](https://www.goodreads.com/book/show/44767248-human-compatible), [Life 3.0](https://www.goodreads.com/book/show/34272565-life-3-0), [The Precipice](https://www.goodreads.com/book/show/50485582-the-precipice), and others
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the copyright books back in? Maybe don't advertise this until we get permissions straightened out? If pinecone embeddings is now done in ARD instead of stampy-chat, we can just make sure the copyrighted material doesn't go to HF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dunno why this was added here...



class SummaryDataset(AlignmentDataset):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI importai and ml_safety_newsletter also function as summaries. It's probably buried in one of the issues somewhere.

Should we treat arxiv abstracts as a summary too? My thought is this would be the field we search when some asks for "the blog or paper about ..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an explicit issue for them - #113

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abstracts should now be saved as summaries

@mruwnik
Copy link
Collaborator Author

mruwnik commented Aug 4, 2023

Both the daily and weekly synchs worked, so now all the datasets should be autoingested :D

@mruwnik mruwnik requested a review from ccstan99 August 4, 2023 10:23
article.authors = ','.join(article.authors[:1024].split(',')[:-1])
return article

def make_data_entry(self, data, **kwargs) -> Article:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better! I would have expected _add_authors to be some Article method like set_authors, but this version has the advantage of being right next to the method that uses it so potentially easier to keep track of

@mruwnik mruwnik merged commit ebf3481 into main Aug 7, 2023
@mruwnik mruwnik deleted the fix-actions branch August 7, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants