-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix actions #111
Fix actions #111
Conversation
3783f8c
to
06ccd55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! You should be able to add the needed keys too.
.github/workflows/fetch-dataset.yml
Outdated
@@ -55,7 +54,6 @@ on: | |||
- qualiacomputing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nonarxiv_paper can be removed along with report. They've both been moved to either xml or pdfs.
qualiacomputing can also be removed since just Murphant helped us identify the relevant pages were added to special doc, just need metadata and get scraped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nonarxiv_paper was going through the xml files. I renamed it to be explicit
workflow_dispatch: # allow manual triggering | ||
schedule: | ||
- cron: "0 3 * * 0" # Every Sunday at 3 AM | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we pushing to HF? Or is HF pulling live data from MySQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushing to HF. The HF dataloader class thingy can pull from SQL, but the datasets on their page are static files which can be used by anyone. This step isn't really needed - it's just a nice bonus thing which might as well be left in
README.md
Outdated
@@ -20,8 +20,8 @@ The following list of sources may change and items may be renamed: | |||
- [deepmind_blog](https://deepmindsafetyresearch.medium.com/) | |||
- [distill](https://distill.pub/) | |||
- [eaforum](https://forum.effectivealtruism.org/) - selected posts | |||
- ebooks - books include [Superintelligence](https://www.goodreads.com/book/show/20527133-superintelligence), [Human Compatible](https://www.goodreads.com/book/show/44767248-human-compatible), [Life 3.0](https://www.goodreads.com/book/show/34272565-life-3-0), [The Precipice](https://www.goodreads.com/book/show/50485582-the-precipice), and others |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the copyright books back in? Maybe don't advertise this until we get permissions straightened out? If pinecone embeddings is now done in ARD instead of stampy-chat, we can just make sure the copyrighted material doesn't go to HF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dunno why this was added here...
|
||
|
||
class SummaryDataset(AlignmentDataset): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI importai and ml_safety_newsletter also function as summaries. It's probably buried in one of the issues somewhere.
Should we treat arxiv abstracts as a summary too? My thought is this would be the field we search when some asks for "the blog or paper about ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an explicit issue for them - #113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abstracts should now be saved as summaries
article.authors = ','.join(article.authors[:1024].split(',')[:-1]) | ||
return article | ||
|
||
def make_data_entry(self, data, **kwargs) -> Article: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better! I would have expected _add_authors to be some Article method like set_authors, but this version has the advantage of being right next to the method that uses it so potentially easier to keep track of
Get all datasets working with the database: