-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update print styles for docraptor and add python script for generating PDFs #339
Conversation
Your Render PR Server URL is https://startwords-dev-pr-339.onrender.com. Follow its progress at https://dashboard.render.com/static/srv-cjvkf5p5mpss73c6mku0. |
@quadrismegistus could you look over the new python script and see if it all makes sense / if you have any concerns? If you're up for testing it out and making sure my documentation is enough, that would be helpful too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! very elegant
To use: | ||
|
||
- install python dependencies: `pip install docraptor bs4 requests` | ||
- create a DocRaptor account and get an API key https://docraptor.com | ||
- set your API key as environment variable DOCRAPTOR_API_KEY | ||
- run the script with a Startwords article url and an output filename, e.g. | ||
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ -o | ||
startwords-3-mapping-latent-spaces.pdf` | ||
|
||
The DocRaptor API allows unlimited test PDFs, which are watermarked; creating | ||
a test PDF is the default behavior for this script. When you are ready to | ||
create a final PDF, use the `--no-test` flag to turn test mode off. | ||
|
||
To generate PDFs for all the feature articles in one Startwords issue, | ||
use the `--issue` flag and run with the issue url, e.g., | ||
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ --issue` | ||
|
||
PDFs will be named according to current Startwords PDF naming conventions. | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, very clear documentation, and it's cool it accepts (already legible) URLs as input
response = doc_api.create_doc( | ||
{ | ||
# test documents are free but watermarked | ||
"test": test, | ||
"document_type": "pdf", | ||
"document_url": url, | ||
# "javascript": True, | ||
"prince_options": { | ||
"media": "print", # @media 'screen' or 'print' CSS | ||
}, | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this just returns the raw binary for the output PDF? Nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also reads in the URL directly. Is it investigating the DOM in some way? Looking for paragraph tags etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's using the generated html + the print styles and then somehow applying them to created a paginated pdf that uses the paged media css spec.
There's an option to turn on javascript if you need it for rendering; I don't think we need it for any of our SW PDFs so keeping that disabled for now.
# find the links for the articles | ||
# limit to links under /issues/ (no author links, DOIs, etc) | ||
article_links = issue.find(id="features").css.select('a[href^="/issues/"]') | ||
# snippets are under features; get a snippet link list to exclude them | ||
snippets = issue.css.select('.snippets a[href^="/issues/"]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- we put at id="features" in the HTML there for this or similar purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I guess my comment is a little unclear! all the articles are under the features section, it's already structured that way in the html; this was the simplest way I could find to get the articles and exclude other links
link["href"].strip("/").split("/")[1:] | ||
) | ||
print(f"\nCreating PDF from { article_url }\n\t➡️ { output_filename }") | ||
create_pdf(article_url, output_filename, test=test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, so this just calls the other function for each article
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser( | ||
description="Generate a PDF from a url", | ||
) | ||
parser.add_argument("url") | ||
parser.add_argument("-o", "--output", help="Output filename for PDF") | ||
parser.add_argument( | ||
"--no-test", | ||
action="store_false", # turn test mode off when this parameter is specified | ||
default="true", # testing is true by default | ||
help="Turn off the test flag and generate real PDFs", | ||
) | ||
parser.add_argument("--issue", action="store_true") | ||
|
||
args = parser.parse_args() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you've used click
before, it simplifies argparsing, but it's another dependency and this works fine
@@ -13,32 +13,31 @@ | |||
} | |||
|
|||
@top-right { | |||
content: element(page-header); | |||
content: flow(page-header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does flow
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's part of the paged media CSS spec - it takes an element out of the regular document and inserts it into a space for a running header or footer.
Here's the docraptor documentation for it, which is a little more readable & targeted than the full paged media spec https://docraptor.com/documentation/article/1067094-headers-footers
I'm still not sure why I had to switch from element
to flow
but this is what they have in their documentation and it worked the way I wanted. (We use this to put the SW logo in the top right corner, DOI in the bottom left, etc.)
Since Grant has reviewed and signed off on the PDFs generated by this new code and now that Ryan has reviewed the python code, I'm going to get this merged into develop so we can start working with the new print styles and PDF generation as we work on finishing the remaining figures. |
in this PR:
question: