Update print styles for docraptor and add python script for generating PDFs #339

rlskoeser · 2023-09-11T17:00:06Z

in this PR:

revised print styles to work with docraptor; PDFs have been generated from this branch and reviewed by Grant
new python script to generate PDFs using the docraptor API and docraptor python package
modes for single article PDF and for generating PDFs for all feature articles in an issue; IDK how far this gets us towards automate generating article PDFs #6 but it's a nice step that I think will save me time!
usage instructions at the top of the python script

question:

is documentation in the script sufficient, or should this be documented in the readme?

render · 2023-09-11T17:00:08Z

Your Render PR Server URL is https://startwords-dev-pr-339.onrender.com.

Follow its progress at https://dashboard.render.com/static/srv-cjvkf5p5mpss73c6mku0.

rlskoeser · 2023-09-19T21:21:06Z

@quadrismegistus could you look over the new python script and see if it all makes sense / if you have any concerns? If you're up for testing it out and making sure my documentation is enough, that would be helpful too.

quadrismegistus

Looks great! very elegant

quadrismegistus · 2023-09-20T13:30:10Z

scripts/create_pdf.py

+To use:
+
+- install python dependencies: `pip install docraptor bs4 requests`
+- create a DocRaptor account and get an API key https://docraptor.com
+- set your API key as environment variable DOCRAPTOR_API_KEY
+- run the script with a Startwords article url and an output filename, e.g.
+  `create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ -o
+  startwords-3-mapping-latent-spaces.pdf`
+
+The DocRaptor API allows unlimited test PDFs, which are watermarked; creating
+a test PDF is the default behavior for this script. When you are ready to
+create a final PDF, use the `--no-test` flag to turn test mode off.
+
+To generate PDFs for all the feature articles in one Startwords issue,
+use the `--issue` flag and run with the issue url, e.g.,
+`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ --issue`
+
+PDFs will be named according to current Startwords PDF naming conventions.
+
+"""


Nice, very clear documentation, and it's cool it accepts (already legible) URLs as input

quadrismegistus · 2023-09-20T13:31:08Z

scripts/create_pdf.py

+        response = doc_api.create_doc(
+            {
+                # test documents are free but watermarked
+                "test": test,
+                "document_type": "pdf",
+                "document_url": url,
+                # "javascript": True,
+                "prince_options": {
+                    "media": "print",  # @media 'screen' or 'print' CSS
+                },
+            }
+        )


So this just returns the raw binary for the output PDF? Nice

It also reads in the URL directly. Is it investigating the DOM in some way? Looking for paragraph tags etc?

It's using the generated html + the print styles and then somehow applying them to created a paginated pdf that uses the paged media css spec.

There's an option to turn on javascript if you need it for rendering; I don't think we need it for any of our SW PDFs so keeping that disabled for now.

quadrismegistus · 2023-09-20T13:38:34Z

scripts/create_pdf.py

+    # find the links for the articles
+    # limit to links under /issues/ (no author links, DOIs, etc)
+    article_links = issue.find(id="features").css.select('a[href^="/issues/"]')
+    # snippets are under features; get a snippet link list to exclude them
+    snippets = issue.css.select('.snippets a[href^="/issues/"]')


Nice -- we put at id="features" in the HTML there for this or similar purposes?

oh, I guess my comment is a little unclear! all the articles are under the features section, it's already structured that way in the html; this was the simplest way I could find to get the articles and exclude other links

quadrismegistus · 2023-09-20T13:39:15Z

scripts/create_pdf.py

+            link["href"].strip("/").split("/")[1:]
+        )
+        print(f"\nCreating PDF from { article_url }\n\t➡️ { output_filename }")
+        create_pdf(article_url, output_filename, test=test)


Nice, so this just calls the other function for each article

quadrismegistus · 2023-09-20T13:39:47Z

scripts/create_pdf.py

+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Generate a PDF from a url",
+    )
+    parser.add_argument("url")
+    parser.add_argument("-o", "--output", help="Output filename for PDF")
+    parser.add_argument(
+        "--no-test",
+        action="store_false",  # turn test mode off when this parameter is specified
+        default="true",  # testing is true by default
+        help="Turn off the test flag and generate real PDFs",
+    )
+    parser.add_argument("--issue", action="store_true")
+
+    args = parser.parse_args()


I don't know if you've used click before, it simplifies argparsing, but it's another dependency and this works fine

quadrismegistus · 2023-09-20T13:40:14Z

themes/startwords/assets/scss/print.scss

@@ -13,32 +13,31 @@
  }

  @top-right {
-    content: element(page-header);
+    content: flow(page-header);


what does flow do?

It's part of the paged media CSS spec - it takes an element out of the regular document and inserts it into a space for a running header or footer.

Here's the docraptor documentation for it, which is a little more readable & targeted than the full paged media spec https://docraptor.com/documentation/article/1067094-headers-footers

I'm still not sure why I had to switch from element to flow but this is what they have in their documentation and it worked the way I wanted. (We use this to put the SW logo in the top right corner, DOI in the bottom left, etc.)

rlskoeser · 2023-09-20T13:48:59Z

Since Grant has reviewed and signed off on the PDFs generated by this new code and now that Ryan has reviewed the python code, I'm going to get this merged into develop so we can start working with the new print styles and PDF generation as we work on finishing the remaining figures.

Explicitly set opacity for marginal print content

1b53af6

rlskoeser marked this pull request as draft September 11, 2023 17:00

gwijthoff added this to the Issue 4 milestone Sep 12, 2023

rlskoeser added 15 commits September 14, 2023 10:48

Restore non-variable font implementation for Source Sans Pro

29f0c88

Adjust how print-only content is hidden

5e52dfc

Use flow instead of running position for marginal content

d3b69f5

Fix additional pdf caption note for deepzoom

8efee59

Test not hiding print-only content for docraptor pdf

2f3063e

Adjust header/footer page flow elements

047b115

Fix page number counter; unhide doi used for pdf footer

974d735

Revise wording for deepzoom viewer pdf note

7a5ec85

Only display print-only assets for docraptor

ad7e91c

Merge branch 'develop' into print-styles

6757e78

Add and configure Noto Emoji for rendering print/PDF emoji

413685b

Merge branch 'develop' into print-styles

ffb8232

Restore 🦜 in stochastic parrots paper title

a159b14

Script to create PDFs with DocRaptor API

0e0b09e

Add an option to create all PDFs for one issue

0ad8a92

rlskoeser changed the title ~~Update print styles for docraptor (WIP)~~ Update print styles for docraptor and add python script for generating PDFs Sep 19, 2023

rlskoeser marked this pull request as ready for review September 19, 2023 21:10

rlskoeser added 2 commits September 19, 2023 17:12

Merge branch 'develop' into print-styles

04e5361

Clean up print styles

e7724f5

rlskoeser requested a review from quadrismegistus September 19, 2023 21:18

Update lighthouse url for renamed article

c80a19e

rlskoeser requested a review from gwijthoff September 19, 2023 21:21

quadrismegistus approved these changes Sep 20, 2023

View reviewed changes

rlskoeser merged commit 927e9e4 into develop Sep 20, 2023
2 checks passed

rlskoeser deleted the print-styles branch September 20, 2023 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update print styles for docraptor and add python script for generating PDFs #339

Update print styles for docraptor and add python script for generating PDFs #339

rlskoeser commented Sep 11, 2023 •

edited

Loading

render bot commented Sep 11, 2023

rlskoeser commented Sep 19, 2023

quadrismegistus left a comment

quadrismegistus Sep 20, 2023

quadrismegistus Sep 20, 2023

quadrismegistus Sep 20, 2023

rlskoeser Sep 20, 2023

quadrismegistus Sep 20, 2023

rlskoeser Sep 20, 2023

quadrismegistus Sep 20, 2023

quadrismegistus Sep 20, 2023

quadrismegistus Sep 20, 2023

rlskoeser Sep 20, 2023

rlskoeser commented Sep 20, 2023

Update print styles for docraptor and add python script for generating PDFs #339

Update print styles for docraptor and add python script for generating PDFs #339

Conversation

rlskoeser commented Sep 11, 2023 • edited Loading

render bot commented Sep 11, 2023

rlskoeser commented Sep 19, 2023

quadrismegistus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser commented Sep 20, 2023

rlskoeser commented Sep 11, 2023 •

edited

Loading