Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update print styles for docraptor and add python script for generating PDFs #339

Merged
merged 19 commits into from
Sep 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion content/issues/4/sonorous-medieval/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,7 +451,7 @@ In using historical secondary sources for training an NLP model, we have found t

Our project continues the practice of reading classical texts as data, but with a different aim than previous iterations of this practice: our goal is to produce a machine learning algorithm. Some of the outlined steps that we have used to parse the *Jingdian Shiwen* are still works in progress, and additionally, some of the most difficult work remains to be approached. Constructing a statistical model that can represent all the complexities of current hypotheses regarding Old Chinese syllables will strain the limits of contemporary NLP platforms, most of which have no concept whatsoever of phonology. Our aim here is to persuade our peers that this task, and others like it in other languages, is not just possible but wholly worthwhile for researchers in the humanities. For this reason, we believe that further collaboration --- with digital humanists, philologists, and others interested in expanding the debates around ancient texts to incorporate sound --- is one of the most generative approaches to making use of NLP frameworks in the study of ancient texts.

[^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922).
[^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922).

[^2]: This phrase is referring to the idea that in order to get a sense of the meaning of a target word, one need only carefully to select from among its surrounding context words; it draws from the title of Ashish Vaswani et al., "Attention Is All You Need," *NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems* (Red Hook: Curran Associates, 2017): 6000--10, [https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf](https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).  "Attention" refers specifically to an algorithmic method of calculating the importance of context words relative to the target.

Expand Down
2 changes: 1 addition & 1 deletion lighthouserc.js
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ module.exports = {
"/issues/1/",
"/issues/1/data-beyond-vision/",
"/issues/4/",
"/issues/4/unnatural-language/",
"/issues/4/sonorous-medieval/",
"/issues/4/toward-deep-map/",
"/authors/",
"/404.html"
Expand Down
123 changes: 123 additions & 0 deletions scripts/create_pdf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
#!/usr/bin/env python
"""
python script to create PDFs for Startwords articles using the DocRaptor API.

To use:

- install python dependencies: `pip install docraptor bs4 requests`
- create a DocRaptor account and get an API key https://docraptor.com
- set your API key as environment variable DOCRAPTOR_API_KEY
- run the script with a Startwords article url and an output filename, e.g.
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ -o
startwords-3-mapping-latent-spaces.pdf`

The DocRaptor API allows unlimited test PDFs, which are watermarked; creating
a test PDF is the default behavior for this script. When you are ready to
create a final PDF, use the `--no-test` flag to turn test mode off.

To generate PDFs for all the feature articles in one Startwords issue,
use the `--issue` flag and run with the issue url, e.g.,
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ --issue`

PDFs will be named according to current Startwords PDF naming conventions.

"""
Comment on lines +5 to +24

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, very clear documentation, and it's cool it accepts (already legible) URLs as input

import argparse
import os
from urllib.parse import urlparse

import docraptor
import requests
from bs4 import BeautifulSoup


DOCRAPTOR_API_KEY = os.environ.get("DOCRAPTOR_API_KEY", None)


def create_pdf(url, output="output.pdf", test=True):
doc_api = docraptor.DocApi()
doc_api.api_client.configuration.username = DOCRAPTOR_API_KEY

output_filename = output or "output.pdf"

try:
response = doc_api.create_doc(
{
# test documents are free but watermarked
"test": test,
"document_type": "pdf",
"document_url": url,
# "javascript": True,
"prince_options": {
"media": "print", # @media 'screen' or 'print' CSS
},
}
)
Comment on lines +44 to +55

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this just returns the raw binary for the output PDF? Nice

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also reads in the URL directly. Is it investigating the DOM in some way? Looking for paragraph tags etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using the generated html + the print styles and then somehow applying them to created a paginated pdf that uses the paged media css spec.

There's an option to turn on javascript if you need it for rendering; I don't think we need it for any of our SW PDFs so keeping that disabled for now.


# create_doc() returns a binary string
with open(output_filename, "w+b") as f:
binary_formatted_response = bytearray(response)
f.write(binary_formatted_response)
f.close()
print(f"Successfully created PDF: { output_filename }")
except docraptor.rest.ApiException as error:
print(error.status)
print(error.reason)
print(error.body)


def create_issue_pdfs(issue_url, test=True):
response = requests.get(issue_url)
if not response.status_code == requests.codes.ok:
response.raise_for_status()

# parse the xml and find the links to feature articles
issue = BeautifulSoup(response.content, "html.parser")
# find the links for the articles
# limit to links under /issues/ (no author links, DOIs, etc)
article_links = issue.find(id="features").css.select('a[href^="/issues/"]')
# snippets are under features; get a snippet link list to exclude them
snippets = issue.css.select('.snippets a[href^="/issues/"]')
Comment on lines +76 to +80

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- we put at id="features" in the HTML there for this or similar purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I guess my comment is a little unclear! all the articles are under the features section, it's already structured that way in the html; this was the simplest way I could find to get the articles and exclude other links


# article links are relative; use issue url to create absolute urls
parsed_issue_url = urlparse(issue_url)
for link in article_links:
# skip snippets for now (make configurable via parameter when needed)
if link in snippets:
continue
article_parsed_url = parsed_issue_url._replace(path=link["href"])
article_url = article_parsed_url.geturl()
# filenames are similar to urls;
# convert /issues/N/title-slug/ to startwords-N-title-slug.pdf
output_filename = "startwords-%s.pdf" % "-".join(
link["href"].strip("/").split("/")[1:]
)
print(f"\nCreating PDF from { article_url }\n\t➡️ { output_filename }")
create_pdf(article_url, output_filename, test=test)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so this just calls the other function for each article



if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Generate a PDF from a url",
)
parser.add_argument("url")
parser.add_argument("-o", "--output", help="Output filename for PDF")
parser.add_argument(
"--no-test",
action="store_false", # turn test mode off when this parameter is specified
default="true", # testing is true by default
help="Turn off the test flag and generate real PDFs",
)
parser.add_argument("--issue", action="store_true")

args = parser.parse_args()
Comment on lines +99 to +113

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you've used click before, it simplifies argparsing, but it's another dependency and this works fine


if DOCRAPTOR_API_KEY is None:
print("DocRaptor API key must be set in environment as DOCRAPTOR_API_KEY")
exit(-1)

# creating a single article or all articles for one issue?
if args.issue:
create_issue_pdfs(args.url, test=args.no_test)
else:
create_pdf(args.url, args.output, test=args.no_test)
9 changes: 9 additions & 0 deletions themes/startwords/assets/scss/_fonts.scss
Original file line number Diff line number Diff line change
Expand Up @@ -195,3 +195,12 @@
// with the main body text, which as currently font-weight 300
font-weight: 300;
}

/* Noto Emoji (black & white version, for print) */
@font-face {
font-family: "Noto Emoji";
font-display: swap;
src: url('/fonts/NotoEmoji/NotoEmoji-Regular.woff2') format('woff2'),
url('/fonts/NotoEmoji/NotoEmoji-Regular.woff') format('woff');
unicode-range: U+1F600—1F64F, U+1F900-1F9FF; /* unicode emoji range */
}
55 changes: 35 additions & 20 deletions themes/startwords/assets/scss/print.scss
Original file line number Diff line number Diff line change
Expand Up @@ -13,32 +13,31 @@
}

@top-right {
content: element(page-header);
content: flow(page-header);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does flow do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the paged media CSS spec - it takes an element out of the regular document and inserts it into a space for a running header or footer.

Here's the docraptor documentation for it, which is a little more readable & targeted than the full paged media spec https://docraptor.com/documentation/article/1067094-headers-footers

I'm still not sure why I had to switch from element to flow but this is what they have in their documentation and it worked the way I wanted. (We use this to put the SW logo in the top right corner, DOI in the bottom left, etc.)

opacity: 1;
}

@bottom-left {
content: element(page-footer);
content: flow(page-footer);
opacity: 1;
font-size: 11px;
font-weight: 300;
}


@bottom-right {
content: counter(page);
font-size: 11px;
font-weight: 300;
}
}

@page :first {
@page:first {
/* don't display article title on first page */
@top-left {
content: '';
}
@top-right {
content: element(first-page-header);
content: flow(first-page-header);
}
}

Expand All @@ -48,7 +47,9 @@ html, body, main {


body {
font-family: $font-serif;
/* use black & white Noto Emoji as fallback for emoji display in print
and in PDFs generated by DocRaptor */
font-family: $font-serif, "Noto Emoji";
font-size: 12px;
line-height: 18px;

Expand All @@ -63,7 +64,7 @@ body {
margin-bottom: 0.5rem;

.number, .theme, .authors, .doi, .tags, time {
font-family: $font-sans;
font-family: $font-sans, "Noto Emoji";
font-size: 14px;
}

Expand Down Expand Up @@ -95,7 +96,6 @@ body {
}

blockquote.pull {
/* width: 50%; */
width: 66%; /* wider makes it easier to flow content */
margin-top: 0;
line-height: 22px;
Expand Down Expand Up @@ -166,28 +166,37 @@ main {
font-weight: 500;
margin: 16px 0 6px;
}
}

.doi {
string-set: doi content();
}
}

.print-only {
display: block;
/* hide for pinting in browsers that don't support @page styles */
opacity: 0;
/* only display print-only marginal content for docraptor */
@media (update:none) and (pointer:none) and (hover:none) {
.print-only {
display: block;
}
}

a.page-header {
position: running(page-header);
flow: static(page-header);
// position: running(page-header);
width: 100%;
text-align: right;
display: block;
}
a.first-page-header {
position: running(first-page-header);
}

a.article-title {
position: running(header-title);
flow: static(first-page-header);
width: 100%;
text-align: right;
display: block;
// position: running(first-page-header);
}

a.page-doi {
position: running(page-footer);
flow: static(page-footer);
// position: running(page-footer);
}


Expand All @@ -209,6 +218,7 @@ body.article .footnotes {
font-size: 12px;
line-height: 18px;
position: relative;
font-family: $font-serif, "Noto Emoji";

/** NOTE: counters are not incrementing when generating PDFs with
pagedjs (all endnotes are numbered 0). As a workaround,
Expand Down Expand Up @@ -247,6 +257,11 @@ body > header, nav, footer, a[rel=alternate] {
display: none;
}

// unhide doi used for pdf footers
.print-only a.page-doi[rel=alternate] {
display: inline-block;
}


a.footnote-ref {
padding: 0;
Expand Down
4 changes: 2 additions & 2 deletions themes/startwords/layouts/shortcodes/deepzoom.html
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@
<img src="{{ .Get `pdf-img` }}" alt="{{ .Get `pdf-alt` }}" loading="lazy"/>
<figcaption>
{{ with .Get `caption` }}
<p>{{ . | markdownify }}</p></figcaption>
<p>{{ . | markdownify }}</p>
{{ end }}
<p>The online version of this essay includes an interactive deep zoom viewer displaying a high resolution capture of this object.</p></figcaption>
<p><small>The online version of this essay includes an interactive deep zoom viewer displaying a high resolution version of this image.</small></p></figcaption>
</figure>
{{ end }}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading