Skip to content

Commit

Permalink
Merge pull request #339 from Princeton-CDH/print-styles
Browse files Browse the repository at this point in the history
Update print styles for docraptor and add python script for generating PDFs
  • Loading branch information
rlskoeser authored Sep 20, 2023
2 parents a0fd245 + c80a19e commit 927e9e4
Show file tree
Hide file tree
Showing 13 changed files with 171 additions and 24 deletions.
2 changes: 1 addition & 1 deletion content/issues/4/sonorous-medieval/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,7 +451,7 @@ In using historical secondary sources for training an NLP model, we have found t

Our project continues the practice of reading classical texts as data, but with a different aim than previous iterations of this practice: our goal is to produce a machine learning algorithm. Some of the outlined steps that we have used to parse the *Jingdian Shiwen* are still works in progress, and additionally, some of the most difficult work remains to be approached. Constructing a statistical model that can represent all the complexities of current hypotheses regarding Old Chinese syllables will strain the limits of contemporary NLP platforms, most of which have no concept whatsoever of phonology. Our aim here is to persuade our peers that this task, and others like it in other languages, is not just possible but wholly worthwhile for researchers in the humanities. For this reason, we believe that further collaboration --- with digital humanists, philologists, and others interested in expanding the debates around ancient texts to incorporate sound --- is one of the most generative approaches to making use of NLP frameworks in the study of ancient texts.

[^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922).
[^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922).

[^2]: This phrase is referring to the idea that in order to get a sense of the meaning of a target word, one need only carefully to select from among its surrounding context words; it draws from the title of Ashish Vaswani et al., "Attention Is All You Need," *NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems* (Red Hook: Curran Associates, 2017): 6000--10, [https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf](https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).  "Attention" refers specifically to an algorithmic method of calculating the importance of context words relative to the target.

Expand Down
2 changes: 1 addition & 1 deletion lighthouserc.js
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ module.exports = {
"/issues/1/",
"/issues/1/data-beyond-vision/",
"/issues/4/",
"/issues/4/unnatural-language/",
"/issues/4/sonorous-medieval/",
"/issues/4/toward-deep-map/",
"/authors/",
"/404.html"
Expand Down
123 changes: 123 additions & 0 deletions scripts/create_pdf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
#!/usr/bin/env python
"""
python script to create PDFs for Startwords articles using the DocRaptor API.
To use:
- install python dependencies: `pip install docraptor bs4 requests`
- create a DocRaptor account and get an API key https://docraptor.com
- set your API key as environment variable DOCRAPTOR_API_KEY
- run the script with a Startwords article url and an output filename, e.g.
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ -o
startwords-3-mapping-latent-spaces.pdf`
The DocRaptor API allows unlimited test PDFs, which are watermarked; creating
a test PDF is the default behavior for this script. When you are ready to
create a final PDF, use the `--no-test` flag to turn test mode off.
To generate PDFs for all the feature articles in one Startwords issue,
use the `--issue` flag and run with the issue url, e.g.,
`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ --issue`
PDFs will be named according to current Startwords PDF naming conventions.
"""
import argparse
import os
from urllib.parse import urlparse

import docraptor
import requests
from bs4 import BeautifulSoup


DOCRAPTOR_API_KEY = os.environ.get("DOCRAPTOR_API_KEY", None)


def create_pdf(url, output="output.pdf", test=True):
doc_api = docraptor.DocApi()
doc_api.api_client.configuration.username = DOCRAPTOR_API_KEY

output_filename = output or "output.pdf"

try:
response = doc_api.create_doc(
{
# test documents are free but watermarked
"test": test,
"document_type": "pdf",
"document_url": url,
# "javascript": True,
"prince_options": {
"media": "print", # @media 'screen' or 'print' CSS
},
}
)

# create_doc() returns a binary string
with open(output_filename, "w+b") as f:
binary_formatted_response = bytearray(response)
f.write(binary_formatted_response)
f.close()
print(f"Successfully created PDF: { output_filename }")
except docraptor.rest.ApiException as error:
print(error.status)
print(error.reason)
print(error.body)


def create_issue_pdfs(issue_url, test=True):
response = requests.get(issue_url)
if not response.status_code == requests.codes.ok:
response.raise_for_status()

# parse the xml and find the links to feature articles
issue = BeautifulSoup(response.content, "html.parser")
# find the links for the articles
# limit to links under /issues/ (no author links, DOIs, etc)
article_links = issue.find(id="features").css.select('a[href^="/issues/"]')
# snippets are under features; get a snippet link list to exclude them
snippets = issue.css.select('.snippets a[href^="/issues/"]')

# article links are relative; use issue url to create absolute urls
parsed_issue_url = urlparse(issue_url)
for link in article_links:
# skip snippets for now (make configurable via parameter when needed)
if link in snippets:
continue
article_parsed_url = parsed_issue_url._replace(path=link["href"])
article_url = article_parsed_url.geturl()
# filenames are similar to urls;
# convert /issues/N/title-slug/ to startwords-N-title-slug.pdf
output_filename = "startwords-%s.pdf" % "-".join(
link["href"].strip("/").split("/")[1:]
)
print(f"\nCreating PDF from { article_url }\n\t➡️ { output_filename }")
create_pdf(article_url, output_filename, test=test)


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Generate a PDF from a url",
)
parser.add_argument("url")
parser.add_argument("-o", "--output", help="Output filename for PDF")
parser.add_argument(
"--no-test",
action="store_false", # turn test mode off when this parameter is specified
default="true", # testing is true by default
help="Turn off the test flag and generate real PDFs",
)
parser.add_argument("--issue", action="store_true")

args = parser.parse_args()

if DOCRAPTOR_API_KEY is None:
print("DocRaptor API key must be set in environment as DOCRAPTOR_API_KEY")
exit(-1)

# creating a single article or all articles for one issue?
if args.issue:
create_issue_pdfs(args.url, test=args.no_test)
else:
create_pdf(args.url, args.output, test=args.no_test)
9 changes: 9 additions & 0 deletions themes/startwords/assets/scss/_fonts.scss
Original file line number Diff line number Diff line change
Expand Up @@ -195,3 +195,12 @@
// with the main body text, which as currently font-weight 300
font-weight: 300;
}

/* Noto Emoji (black & white version, for print) */
@font-face {
font-family: "Noto Emoji";
font-display: swap;
src: url('/fonts/NotoEmoji/NotoEmoji-Regular.woff2') format('woff2'),
url('/fonts/NotoEmoji/NotoEmoji-Regular.woff') format('woff');
unicode-range: U+1F600—1F64F, U+1F900-1F9FF; /* unicode emoji range */
}
55 changes: 35 additions & 20 deletions themes/startwords/assets/scss/print.scss
Original file line number Diff line number Diff line change
Expand Up @@ -13,32 +13,31 @@
}

@top-right {
content: element(page-header);
content: flow(page-header);
opacity: 1;
}

@bottom-left {
content: element(page-footer);
content: flow(page-footer);
opacity: 1;
font-size: 11px;
font-weight: 300;
}


@bottom-right {
content: counter(page);
font-size: 11px;
font-weight: 300;
}
}

@page :first {
@page:first {
/* don't display article title on first page */
@top-left {
content: '';
}
@top-right {
content: element(first-page-header);
content: flow(first-page-header);
}
}

Expand All @@ -48,7 +47,9 @@ html, body, main {


body {
font-family: $font-serif;
/* use black & white Noto Emoji as fallback for emoji display in print
and in PDFs generated by DocRaptor */
font-family: $font-serif, "Noto Emoji";
font-size: 12px;
line-height: 18px;

Expand All @@ -63,7 +64,7 @@ body {
margin-bottom: 0.5rem;

.number, .theme, .authors, .doi, .tags, time {
font-family: $font-sans;
font-family: $font-sans, "Noto Emoji";
font-size: 14px;
}

Expand Down Expand Up @@ -95,7 +96,6 @@ body {
}

blockquote.pull {
/* width: 50%; */
width: 66%; /* wider makes it easier to flow content */
margin-top: 0;
line-height: 22px;
Expand Down Expand Up @@ -166,28 +166,37 @@ main {
font-weight: 500;
margin: 16px 0 6px;
}
}

.doi {
string-set: doi content();
}
}

.print-only {
display: block;
/* hide for pinting in browsers that don't support @page styles */
opacity: 0;
/* only display print-only marginal content for docraptor */
@media (update:none) and (pointer:none) and (hover:none) {
.print-only {
display: block;
}
}

a.page-header {
position: running(page-header);
flow: static(page-header);
// position: running(page-header);
width: 100%;
text-align: right;
display: block;
}
a.first-page-header {
position: running(first-page-header);
}

a.article-title {
position: running(header-title);
flow: static(first-page-header);
width: 100%;
text-align: right;
display: block;
// position: running(first-page-header);
}

a.page-doi {
position: running(page-footer);
flow: static(page-footer);
// position: running(page-footer);
}


Expand All @@ -209,6 +218,7 @@ body.article .footnotes {
font-size: 12px;
line-height: 18px;
position: relative;
font-family: $font-serif, "Noto Emoji";

/** NOTE: counters are not incrementing when generating PDFs with
pagedjs (all endnotes are numbered 0). As a workaround,
Expand Down Expand Up @@ -247,6 +257,11 @@ body > header, nav, footer, a[rel=alternate] {
display: none;
}

// unhide doi used for pdf footers
.print-only a.page-doi[rel=alternate] {
display: inline-block;
}


a.footnote-ref {
padding: 0;
Expand Down
4 changes: 2 additions & 2 deletions themes/startwords/layouts/shortcodes/deepzoom.html
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@
<img src="{{ .Get `pdf-img` }}" alt="{{ .Get `pdf-alt` }}" loading="lazy"/>
<figcaption>
{{ with .Get `caption` }}
<p>{{ . | markdownify }}</p></figcaption>
<p>{{ . | markdownify }}</p>
{{ end }}
<p>The online version of this essay includes an interactive deep zoom viewer displaying a high resolution capture of this object.</p></figcaption>
<p><small>The online version of this essay includes an interactive deep zoom viewer displaying a high resolution version of this image.</small></p></figcaption>
</figure>
{{ end }}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 comments on commit 927e9e4

Please sign in to comment.