-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update print styles for docraptor and add python script for generating PDFs #339
Changes from all commits
1b53af6
29f0c88
5e52dfc
d3b69f5
8efee59
2f3063e
047b115
974d735
7a5ec85
ad7e91c
6757e78
413685b
ffb8232
a159b14
0e0b09e
0ad8a92
04e5361
e7724f5
c80a19e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
#!/usr/bin/env python | ||
""" | ||
python script to create PDFs for Startwords articles using the DocRaptor API. | ||
|
||
To use: | ||
|
||
- install python dependencies: `pip install docraptor bs4 requests` | ||
- create a DocRaptor account and get an API key https://docraptor.com | ||
- set your API key as environment variable DOCRAPTOR_API_KEY | ||
- run the script with a Startwords article url and an output filename, e.g. | ||
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ -o | ||
startwords-3-mapping-latent-spaces.pdf` | ||
|
||
The DocRaptor API allows unlimited test PDFs, which are watermarked; creating | ||
a test PDF is the default behavior for this script. When you are ready to | ||
create a final PDF, use the `--no-test` flag to turn test mode off. | ||
|
||
To generate PDFs for all the feature articles in one Startwords issue, | ||
use the `--issue` flag and run with the issue url, e.g., | ||
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ --issue` | ||
|
||
PDFs will be named according to current Startwords PDF naming conventions. | ||
|
||
""" | ||
import argparse | ||
import os | ||
from urllib.parse import urlparse | ||
|
||
import docraptor | ||
import requests | ||
from bs4 import BeautifulSoup | ||
|
||
|
||
DOCRAPTOR_API_KEY = os.environ.get("DOCRAPTOR_API_KEY", None) | ||
|
||
|
||
def create_pdf(url, output="output.pdf", test=True): | ||
doc_api = docraptor.DocApi() | ||
doc_api.api_client.configuration.username = DOCRAPTOR_API_KEY | ||
|
||
output_filename = output or "output.pdf" | ||
|
||
try: | ||
response = doc_api.create_doc( | ||
{ | ||
# test documents are free but watermarked | ||
"test": test, | ||
"document_type": "pdf", | ||
"document_url": url, | ||
# "javascript": True, | ||
"prince_options": { | ||
"media": "print", # @media 'screen' or 'print' CSS | ||
}, | ||
} | ||
) | ||
Comment on lines
+44
to
+55
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this just returns the raw binary for the output PDF? Nice There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It also reads in the URL directly. Is it investigating the DOM in some way? Looking for paragraph tags etc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's using the generated html + the print styles and then somehow applying them to created a paginated pdf that uses the paged media css spec. There's an option to turn on javascript if you need it for rendering; I don't think we need it for any of our SW PDFs so keeping that disabled for now. |
||
|
||
# create_doc() returns a binary string | ||
with open(output_filename, "w+b") as f: | ||
binary_formatted_response = bytearray(response) | ||
f.write(binary_formatted_response) | ||
f.close() | ||
print(f"Successfully created PDF: { output_filename }") | ||
except docraptor.rest.ApiException as error: | ||
print(error.status) | ||
print(error.reason) | ||
print(error.body) | ||
|
||
|
||
def create_issue_pdfs(issue_url, test=True): | ||
response = requests.get(issue_url) | ||
if not response.status_code == requests.codes.ok: | ||
response.raise_for_status() | ||
|
||
# parse the xml and find the links to feature articles | ||
issue = BeautifulSoup(response.content, "html.parser") | ||
# find the links for the articles | ||
# limit to links under /issues/ (no author links, DOIs, etc) | ||
article_links = issue.find(id="features").css.select('a[href^="/issues/"]') | ||
# snippets are under features; get a snippet link list to exclude them | ||
snippets = issue.css.select('.snippets a[href^="/issues/"]') | ||
Comment on lines
+76
to
+80
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice -- we put at id="features" in the HTML there for this or similar purposes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, I guess my comment is a little unclear! all the articles are under the features section, it's already structured that way in the html; this was the simplest way I could find to get the articles and exclude other links |
||
|
||
# article links are relative; use issue url to create absolute urls | ||
parsed_issue_url = urlparse(issue_url) | ||
for link in article_links: | ||
# skip snippets for now (make configurable via parameter when needed) | ||
if link in snippets: | ||
continue | ||
article_parsed_url = parsed_issue_url._replace(path=link["href"]) | ||
article_url = article_parsed_url.geturl() | ||
# filenames are similar to urls; | ||
# convert /issues/N/title-slug/ to startwords-N-title-slug.pdf | ||
output_filename = "startwords-%s.pdf" % "-".join( | ||
link["href"].strip("/").split("/")[1:] | ||
) | ||
print(f"\nCreating PDF from { article_url }\n\t➡️ { output_filename }") | ||
create_pdf(article_url, output_filename, test=test) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice, so this just calls the other function for each article |
||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser( | ||
description="Generate a PDF from a url", | ||
) | ||
parser.add_argument("url") | ||
parser.add_argument("-o", "--output", help="Output filename for PDF") | ||
parser.add_argument( | ||
"--no-test", | ||
action="store_false", # turn test mode off when this parameter is specified | ||
default="true", # testing is true by default | ||
help="Turn off the test flag and generate real PDFs", | ||
) | ||
parser.add_argument("--issue", action="store_true") | ||
|
||
args = parser.parse_args() | ||
Comment on lines
+99
to
+113
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know if you've used |
||
|
||
if DOCRAPTOR_API_KEY is None: | ||
print("DocRaptor API key must be set in environment as DOCRAPTOR_API_KEY") | ||
exit(-1) | ||
|
||
# creating a single article or all articles for one issue? | ||
if args.issue: | ||
create_issue_pdfs(args.url, test=args.no_test) | ||
else: | ||
create_pdf(args.url, args.output, test=args.no_test) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,32 +13,31 @@ | |
} | ||
|
||
@top-right { | ||
content: element(page-header); | ||
content: flow(page-header); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's part of the paged media CSS spec - it takes an element out of the regular document and inserts it into a space for a running header or footer. Here's the docraptor documentation for it, which is a little more readable & targeted than the full paged media spec https://docraptor.com/documentation/article/1067094-headers-footers I'm still not sure why I had to switch from |
||
opacity: 1; | ||
} | ||
|
||
@bottom-left { | ||
content: element(page-footer); | ||
content: flow(page-footer); | ||
opacity: 1; | ||
font-size: 11px; | ||
font-weight: 300; | ||
} | ||
|
||
|
||
@bottom-right { | ||
content: counter(page); | ||
font-size: 11px; | ||
font-weight: 300; | ||
} | ||
} | ||
|
||
@page :first { | ||
@page:first { | ||
/* don't display article title on first page */ | ||
@top-left { | ||
content: ''; | ||
} | ||
@top-right { | ||
content: element(first-page-header); | ||
content: flow(first-page-header); | ||
} | ||
} | ||
|
||
|
@@ -48,7 +47,9 @@ html, body, main { | |
|
||
|
||
body { | ||
font-family: $font-serif; | ||
/* use black & white Noto Emoji as fallback for emoji display in print | ||
and in PDFs generated by DocRaptor */ | ||
font-family: $font-serif, "Noto Emoji"; | ||
font-size: 12px; | ||
line-height: 18px; | ||
|
||
|
@@ -63,7 +64,7 @@ body { | |
margin-bottom: 0.5rem; | ||
|
||
.number, .theme, .authors, .doi, .tags, time { | ||
font-family: $font-sans; | ||
font-family: $font-sans, "Noto Emoji"; | ||
font-size: 14px; | ||
} | ||
|
||
|
@@ -95,7 +96,6 @@ body { | |
} | ||
|
||
blockquote.pull { | ||
/* width: 50%; */ | ||
width: 66%; /* wider makes it easier to flow content */ | ||
margin-top: 0; | ||
line-height: 22px; | ||
|
@@ -166,28 +166,37 @@ main { | |
font-weight: 500; | ||
margin: 16px 0 6px; | ||
} | ||
} | ||
|
||
.doi { | ||
string-set: doi content(); | ||
} | ||
} | ||
|
||
.print-only { | ||
display: block; | ||
/* hide for pinting in browsers that don't support @page styles */ | ||
opacity: 0; | ||
/* only display print-only marginal content for docraptor */ | ||
@media (update:none) and (pointer:none) and (hover:none) { | ||
.print-only { | ||
display: block; | ||
} | ||
} | ||
|
||
a.page-header { | ||
position: running(page-header); | ||
flow: static(page-header); | ||
// position: running(page-header); | ||
width: 100%; | ||
text-align: right; | ||
display: block; | ||
} | ||
a.first-page-header { | ||
position: running(first-page-header); | ||
} | ||
|
||
a.article-title { | ||
position: running(header-title); | ||
flow: static(first-page-header); | ||
width: 100%; | ||
text-align: right; | ||
display: block; | ||
// position: running(first-page-header); | ||
} | ||
|
||
a.page-doi { | ||
position: running(page-footer); | ||
flow: static(page-footer); | ||
// position: running(page-footer); | ||
} | ||
|
||
|
||
|
@@ -209,6 +218,7 @@ body.article .footnotes { | |
font-size: 12px; | ||
line-height: 18px; | ||
position: relative; | ||
font-family: $font-serif, "Noto Emoji"; | ||
|
||
/** NOTE: counters are not incrementing when generating PDFs with | ||
pagedjs (all endnotes are numbered 0). As a workaround, | ||
|
@@ -247,6 +257,11 @@ body > header, nav, footer, a[rel=alternate] { | |
display: none; | ||
} | ||
|
||
// unhide doi used for pdf footers | ||
.print-only a.page-doi[rel=alternate] { | ||
display: inline-block; | ||
} | ||
|
||
|
||
a.footnote-ref { | ||
padding: 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, very clear documentation, and it's cool it accepts (already legible) URLs as input