Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update print styles for docraptor and add python script for generating PDFs #339

Merged
merged 19 commits into from
Sep 20, 2023

Conversation

rlskoeser
Copy link
Contributor

@rlskoeser rlskoeser commented Sep 11, 2023

in this PR:

  • revised print styles to work with docraptor; PDFs have been generated from this branch and reviewed by Grant
  • new python script to generate PDFs using the docraptor API and docraptor python package
  • modes for single article PDF and for generating PDFs for all feature articles in an issue; IDK how far this gets us towards automate generating article PDFs #6 but it's a nice step that I think will save me time!
  • usage instructions at the top of the python script

question:

  • is documentation in the script sufficient, or should this be documented in the readme?

@render
Copy link

render bot commented Sep 11, 2023

@rlskoeser rlskoeser marked this pull request as draft September 11, 2023 17:00
@gwijthoff gwijthoff added this to the Issue 4 milestone Sep 12, 2023
@rlskoeser rlskoeser changed the title Update print styles for docraptor (WIP) Update print styles for docraptor and add python script for generating PDFs Sep 19, 2023
@rlskoeser rlskoeser marked this pull request as ready for review September 19, 2023 21:10
@rlskoeser
Copy link
Contributor Author

@quadrismegistus could you look over the new python script and see if it all makes sense / if you have any concerns? If you're up for testing it out and making sure my documentation is enough, that would be helpful too.

Copy link

@quadrismegistus quadrismegistus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! very elegant

Comment on lines +5 to +24
To use:

- install python dependencies: `pip install docraptor bs4 requests`
- create a DocRaptor account and get an API key https://docraptor.com
- set your API key as environment variable DOCRAPTOR_API_KEY
- run the script with a Startwords article url and an output filename, e.g.
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ -o
startwords-3-mapping-latent-spaces.pdf`

The DocRaptor API allows unlimited test PDFs, which are watermarked; creating
a test PDF is the default behavior for this script. When you are ready to
create a final PDF, use the `--no-test` flag to turn test mode off.

To generate PDFs for all the feature articles in one Startwords issue,
use the `--issue` flag and run with the issue url, e.g.,
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ --issue`

PDFs will be named according to current Startwords PDF naming conventions.

"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, very clear documentation, and it's cool it accepts (already legible) URLs as input

Comment on lines +44 to +55
response = doc_api.create_doc(
{
# test documents are free but watermarked
"test": test,
"document_type": "pdf",
"document_url": url,
# "javascript": True,
"prince_options": {
"media": "print", # @media 'screen' or 'print' CSS
},
}
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this just returns the raw binary for the output PDF? Nice

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also reads in the URL directly. Is it investigating the DOM in some way? Looking for paragraph tags etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using the generated html + the print styles and then somehow applying them to created a paginated pdf that uses the paged media css spec.

There's an option to turn on javascript if you need it for rendering; I don't think we need it for any of our SW PDFs so keeping that disabled for now.

Comment on lines +76 to +80
# find the links for the articles
# limit to links under /issues/ (no author links, DOIs, etc)
article_links = issue.find(id="features").css.select('a[href^="/issues/"]')
# snippets are under features; get a snippet link list to exclude them
snippets = issue.css.select('.snippets a[href^="/issues/"]')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- we put at id="features" in the HTML there for this or similar purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I guess my comment is a little unclear! all the articles are under the features section, it's already structured that way in the html; this was the simplest way I could find to get the articles and exclude other links

link["href"].strip("/").split("/")[1:]
)
print(f"\nCreating PDF from { article_url }\n\t➡️ { output_filename }")
create_pdf(article_url, output_filename, test=test)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so this just calls the other function for each article

Comment on lines +99 to +113
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Generate a PDF from a url",
)
parser.add_argument("url")
parser.add_argument("-o", "--output", help="Output filename for PDF")
parser.add_argument(
"--no-test",
action="store_false", # turn test mode off when this parameter is specified
default="true", # testing is true by default
help="Turn off the test flag and generate real PDFs",
)
parser.add_argument("--issue", action="store_true")

args = parser.parse_args()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you've used click before, it simplifies argparsing, but it's another dependency and this works fine

@@ -13,32 +13,31 @@
}

@top-right {
content: element(page-header);
content: flow(page-header);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does flow do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the paged media CSS spec - it takes an element out of the regular document and inserts it into a space for a running header or footer.

Here's the docraptor documentation for it, which is a little more readable & targeted than the full paged media spec https://docraptor.com/documentation/article/1067094-headers-footers

I'm still not sure why I had to switch from element to flow but this is what they have in their documentation and it worked the way I wanted. (We use this to put the SW logo in the top right corner, DOI in the bottom left, etc.)

@rlskoeser
Copy link
Contributor Author

Since Grant has reviewed and signed off on the PDFs generated by this new code and now that Ryan has reviewed the python code, I'm going to get this merged into develop so we can start working with the new print styles and PDF generation as we work on finishing the remaining figures.

@rlskoeser rlskoeser merged commit 927e9e4 into develop Sep 20, 2023
2 checks passed
@rlskoeser rlskoeser deleted the print-styles branch September 20, 2023 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants