Merge pull request #339 from Princeton-CDH/print-styles

Update print styles for docraptor and add python script for generating PDFs
Princeton-CDH · Sep 20, 2023 · 927e9e4 · 927e9e4
2 parents a0fd245 + c80a19e
commit 927e9e4
Show file tree

Hide file tree

Showing 13 changed files with 171 additions and 24 deletions.
diff --git a/content/issues/4/sonorous-medieval/index.md b/content/issues/4/sonorous-medieval/index.md
@@ -451,7 +451,7 @@ In using historical secondary sources for training an NLP model, we have found t
 
 Our project continues the practice of reading classical texts as data, but with a different aim than previous iterations of this practice: our goal is to produce a machine learning algorithm. Some of the outlined steps that we have used to parse the *Jingdian Shiwen* are still works in progress, and additionally, some of the most difficult work remains to be approached. Constructing a statistical model that can represent all the complexities of current hypotheses regarding Old Chinese syllables will strain the limits of contemporary NLP platforms, most of which have no concept whatsoever of phonology. Our aim here is to persuade our peers that this task, and others like it in other languages, is not just possible but wholly worthwhile for researchers in the humanities. For this reason, we believe that further collaboration --- with digital humanists, philologists, and others interested in expanding the debates around ancient texts to incorporate sound --- is one of the most generative approaches to making use of NLP frameworks in the study of ancient texts.
 
-[^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922).
+[^1]: Compare, however, the danger of this tendency, as shown by Emily M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜," *FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency* (New York: Association for Computing Machinery, 2021), 610--23, [https://doi.org/10.1145/3442188.3445922](https://doi.org/10.1145/3442188.3445922).
 
 [^2]: This phrase is referring to the idea that in order to get a sense of the meaning of a target word, one need only carefully to select from among its surrounding context words; it draws from the title of Ashish Vaswani et al., "Attention Is All You Need," *NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems* (Red Hook: Curran Associates, 2017): 6000--10, [https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf](https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).  "Attention" refers specifically to an algorithmic method of calculating the importance of context words relative to the target.
 

diff --git a/lighthouserc.js b/lighthouserc.js
@@ -15,7 +15,7 @@ module.exports = {
         "/issues/1/",
         "/issues/1/data-beyond-vision/",
         "/issues/4/",
-        "/issues/4/unnatural-language/",
+        "/issues/4/sonorous-medieval/",
         "/issues/4/toward-deep-map/",
         "/authors/",
         "/404.html"

diff --git a/scripts/create_pdf.py b/scripts/create_pdf.py
@@ -0,0 +1,123 @@
+#!/usr/bin/env python
+"""
+python script to create PDFs for Startwords articles using the DocRaptor API.
+
+To use:
+
+- install python dependencies: `pip install docraptor bs4 requests`
+- create a DocRaptor account and get an API key https://docraptor.com
+- set your API key as environment variable DOCRAPTOR_API_KEY
+- run the script with a Startwords article url and an output filename, e.g.
+  `create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ -o
+  startwords-3-mapping-latent-spaces.pdf`
+
+The DocRaptor API allows unlimited test PDFs, which are watermarked; creating
+a test PDF is the default behavior for this script. When you are ready to
+create a final PDF, use the `--no-test` flag to turn test mode off.
+
+To generate PDFs for all the feature articles in one Startwords issue,
+use the `--issue` flag and run with the issue url, e.g.,
+`create_pdf.py https://startwords.cdh.princeton.edu/issues/3/mapping-latent-spaces/ --issue`
+
+PDFs will be named according to current Startwords PDF naming conventions.
+
+"""
+import argparse
+import os
+from urllib.parse import urlparse
+
+import docraptor
+import requests
+from bs4 import BeautifulSoup
+
+
+DOCRAPTOR_API_KEY = os.environ.get("DOCRAPTOR_API_KEY", None)
+
+
+def create_pdf(url, output="output.pdf", test=True):
+    doc_api = docraptor.DocApi()
+    doc_api.api_client.configuration.username = DOCRAPTOR_API_KEY
+
+    output_filename = output or "output.pdf"
+
+    try:
+        response = doc_api.create_doc(
+            {
+                # test documents are free but watermarked
+                "test": test,
+                "document_type": "pdf",
+                "document_url": url,
+                # "javascript": True,
+                "prince_options": {
+                    "media": "print",  # @media 'screen' or 'print' CSS
+                },
+            }
+        )
+
+        # create_doc() returns a binary string
+        with open(output_filename, "w+b") as f:
+            binary_formatted_response = bytearray(response)
+            f.write(binary_formatted_response)
+            f.close()
+        print(f"Successfully created PDF: { output_filename }")
+    except docraptor.rest.ApiException as error:
+        print(error.status)
+        print(error.reason)
+        print(error.body)
+
+
+def create_issue_pdfs(issue_url, test=True):
+    response = requests.get(issue_url)
+    if not response.status_code == requests.codes.ok:
+        response.raise_for_status()
+
+    # parse the xml and find the links to feature articles
+    issue = BeautifulSoup(response.content, "html.parser")
+    # find the links for the articles
+    # limit to links under /issues/ (no author links, DOIs, etc)
+    article_links = issue.find(id="features").css.select('a[href^="/issues/"]')
+    # snippets are under features; get a snippet link list to exclude them
+    snippets = issue.css.select('.snippets a[href^="/issues/"]')
+
+    # article links are relative; use issue url to create absolute urls
+    parsed_issue_url = urlparse(issue_url)
+    for link in article_links:
+        # skip snippets for now (make configurable via parameter when needed)
+        if link in snippets:
+            continue
+        article_parsed_url = parsed_issue_url._replace(path=link["href"])
+        article_url = article_parsed_url.geturl()
+        # filenames are similar to urls;
+        # convert /issues/N/title-slug/ to startwords-N-title-slug.pdf
+        output_filename = "startwords-%s.pdf" % "-".join(
+            link["href"].strip("/").split("/")[1:]
+        )
+        print(f"\nCreating PDF from { article_url }\n\t➡️ { output_filename }")
+        create_pdf(article_url, output_filename, test=test)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Generate a PDF from a url",
+    )
+    parser.add_argument("url")
+    parser.add_argument("-o", "--output", help="Output filename for PDF")
+    parser.add_argument(
+        "--no-test",
+        action="store_false",  # turn test mode off when this parameter is specified
+        default="true",  # testing is true by default
+        help="Turn off the test flag and generate real PDFs",
+    )
+    parser.add_argument("--issue", action="store_true")
+
+    args = parser.parse_args()
+
+    if DOCRAPTOR_API_KEY is None:
+        print("DocRaptor API key must be set in environment as DOCRAPTOR_API_KEY")
+        exit(-1)
+
+    # creating a single article or all articles for one issue?
+    if args.issue:
+        create_issue_pdfs(args.url, test=args.no_test)
+    else:
+        create_pdf(args.url, args.output, test=args.no_test)
diff --git a/themes/startwords/assets/scss/_fonts.scss b/themes/startwords/assets/scss/_fonts.scss
@@ -195,3 +195,12 @@
     // with the main body text, which as currently font-weight 300
     font-weight: 300;
 }
+
+/* Noto Emoji (black & white version, for print) */
+@font-face {
+    font-family: "Noto Emoji";
+    font-display: swap;
+    src: url('/fonts/NotoEmoji/NotoEmoji-Regular.woff2') format('woff2'),
+         url('/fonts/NotoEmoji/NotoEmoji-Regular.woff') format('woff');
+    unicode-range: U+1F600—1F64F, U+1F900-1F9FF; /* unicode emoji range */
+}
diff --git a/themes/startwords/assets/scss/print.scss b/themes/startwords/assets/scss/print.scss
@@ -13,32 +13,31 @@
   }
 
   @top-right {
-    content: element(page-header);
+    content: flow(page-header);
     opacity: 1;
   }
 
   @bottom-left {
-    content: element(page-footer);
+    content: flow(page-footer);
     opacity: 1;
     font-size: 11px;
     font-weight: 300;
   }
 
-
   @bottom-right {
     content: counter(page);
     font-size: 11px;
     font-weight: 300;
   }
 }
 
-@page :first {
+@page:first {
     /* don't display article title on first page */
     @top-left {
         content: '';
     }
     @top-right {
-        content: element(first-page-header);
+        content: flow(first-page-header);
     }
 }
 
@@ -48,7 +47,9 @@ html, body, main {
 
 
 body {
-    font-family: $font-serif;
+    /* use black & white Noto Emoji as fallback for emoji display in print
+       and in PDFs generated by DocRaptor */
+    font-family: $font-serif, "Noto Emoji";
     font-size: 12px;
     line-height: 18px;
 
@@ -63,7 +64,7 @@ body {
             margin-bottom: 0.5rem;
 
             .number, .theme, .authors, .doi, .tags, time {
-                font-family: $font-sans;
+                font-family: $font-sans, "Noto Emoji";
                 font-size: 14px;
             }
 
@@ -95,7 +96,6 @@ body {
         }
 
         blockquote.pull {
-            /* width: 50%; */
             width: 66%;  /* wider makes it easier to flow content */
             margin-top: 0;
             line-height: 22px;
@@ -166,28 +166,37 @@ main {
         font-weight: 500;
         margin: 16px 0 6px;
     }
-}
 
+    .doi {
+        string-set: doi content();
+    }
+}
 
-.print-only {
-    display: block;
-    /* hide for pinting in browsers that don't support @page styles */
-    opacity: 0;
+/* only display print-only marginal content for docraptor */
+@media (update:none) and (pointer:none) and (hover:none) {
+    .print-only {
+       display: block;
+    }
 }
 
 a.page-header {
-    position: running(page-header);
+    flow: static(page-header);
+    // position: running(page-header);
+    width: 100%;
+    text-align: right;
+    display: block;
 }
 a.first-page-header {
-    position: running(first-page-header);
-}
-
-a.article-title {
-    position: running(header-title);
+    flow: static(first-page-header);
+    width: 100%;
+    text-align: right;
+    display: block;
+    // position: running(first-page-header);
 }
 
 a.page-doi {
-    position: running(page-footer);
+    flow: static(page-footer);
+    // position: running(page-footer);
 }
 
 
@@ -209,6 +218,7 @@ body.article .footnotes {
             font-size: 12px;
             line-height: 18px;
             position: relative;
+            font-family: $font-serif, "Noto Emoji";
 
             /** NOTE: counters are not incrementing when generating PDFs with
               pagedjs (all endnotes are numbered 0). As a workaround,
@@ -247,6 +257,11 @@ body > header, nav, footer, a[rel=alternate] {
     display: none;
 }
 
+// unhide doi used for pdf footers
+.print-only a.page-doi[rel=alternate] {
+    display: inline-block;
+}
+
 
 a.footnote-ref {
     padding: 0;

diff --git a/themes/startwords/layouts/shortcodes/deepzoom.html b/themes/startwords/layouts/shortcodes/deepzoom.html
@@ -27,8 +27,8 @@
     <img src="{{ .Get `pdf-img` }}" alt="{{ .Get `pdf-alt` }}" loading="lazy"/>
     <figcaption>
     {{ with .Get `caption` }}
-    <p>{{ . | markdownify }}</p></figcaption>
+    <p>{{ . | markdownify }}</p>
     {{ end }}
-    <p>The online version of this essay includes an interactive deep zoom viewer displaying a high resolution capture of this object.</p></figcaption>
+    <p><small>The online version of this essay includes an interactive deep zoom viewer displaying a high resolution version of this image.</small></p></figcaption>
 </figure>
 {{ end }}
diff --git a/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Bold.ttf b/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Bold.ttf
diff --git a/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Light.ttf b/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Light.ttf
diff --git a/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Medium.ttf b/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Medium.ttf
diff --git a/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Regular.ttf b/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Regular.ttf
diff --git a/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Regular.woff b/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Regular.woff
diff --git a/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Regular.woff2 b/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-Regular.woff2
diff --git a/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-SemiBold.ttf b/themes/startwords/static/fonts/NotoEmoji/NotoEmoji-SemiBold.ttf