Add resources2pdf to gh pages

LibraryOfCongress · May 31, 2024 · 13dd164 · 13dd164
1 parent 7db35d2
commit 13dd164
Show file tree

Hide file tree

Showing 8 changed files with 1,005 additions and 1 deletion.
diff --git a/_sources/loc.gov JSON API/resources2pdf.ipynb b/_sources/loc.gov JSON API/resources2pdf.ipynb
@@ -0,0 +1,284 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Downloading an item with multiple pages to PDF\n",
+    "\n",
+    "The loc.gov API provides structured data about Library of Congress collections in JSON and YAML formats. This notebook shows how you can take use the API to access image resources, belonging to an LoC Item, and aggregate them into a single PDF file. \n",
+    "\n",
+    "## Understanding API Responses Review:\n",
+    "\n",
+    "**JSON Response Objects**\n",
+    "Each of the endpoint types has a distinct response format, but they can be broadly grouped into two categories:\n",
+    "- responses to queries for a list of items, or Search Results Responses \n",
+    "- responses to queries for a **single item**, or Item and Resource Responses\n",
+    "\n",
+    "Furthermore, this notebook will focus on the **JSON Response Object** for a **single item** and formatting its corresponding Resources (files that make-up an item, e.g. pictures of book) into a .pdf file.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "There are no prequisites in order to run this notebook, besides the installation of libraries listed in the imports section.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## I. Imports\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from PIL import Image\n",
+    "import os\n",
+    "from io import BytesIO\n",
+    "import requests"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## II. Create a request URL\n",
+    "\n",
+    "First, we will start by ensuring we have a link to an item of interest. In this instance we will look at the [Benjamin Harrison Papers: Series 13, Venezuela Boundary Dispute, 1895-1899; Part 2, 1895-1899](https://www.loc.gov/item/mss250640164/) as an example.\n",
+    "\n",
+    "Notice the format of the link to this item: `https://www.loc.gov/item/mss250640164/`\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Item API Request URL: https://www.loc.gov/item/mss250640164/?fo=json\n"
+     ]
+    }
+   ],
+   "source": [
+    "item_link=\"https://www.loc.gov/item/mss250640164/\"\n",
+    "request_url = item_link + \"?fo=json\"\n",
+    "\n",
+    "# Note: The addition of the \"fo=json\" string ensures that the item request is in JSON format\n",
+    "\n",
+    "print(f'Item API Request URL: {request_url}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will also set a start and end page that we want to download and compile."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "start_page = 1 # starting a 1\n",
+    "end_page = 10 # up to and including this page, make this -1 to retrieve all pages"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## III. Request Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Top-level data structure:\n",
+      "articles_and_essays, cite_this, item, more_like_this, options, related_items, resources, timestamp, type\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Generates request from LOC API to extract data in JSON format\n",
+    "r = requests.get(request_url)\n",
+    "data = r.json()\n",
+    "# print(data)\n",
+    "\n",
+    "# Here is a quick way at looking at the structure of the data\n",
+    "print(\"Top-level data structure:\\n\" + \", \".join(value for value in data.keys()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## IV. Resource Data and Extracting Resource URLS\n",
+    "\n",
+    "In the previous code cell, you can see that the content itself has a lot of Metadata to that can be explored. However, in this notebook we will focus on access to information about the resources.\n",
+    "\n",
+    "As opposed to looking at item with `data['item']` we will look at the resources through `data['resources]`. Furthermore, we will be creating a list of all of the resources image urls with the best resolution (based on the largest height).\n",
+    "\n",
+    "First, let's just retrieve a list of resources/files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total # of Resources: 1,143\n",
+      "Resource Data: caption, files, image, url\n",
+      "Selected 10 files\n"
+     ]
+    }
+   ],
+   "source": [
+    "resources = data['resources'][0]\n",
+    "files = resources['files']\n",
+    "num_resources = len(files)\n",
+    "print(f'Total # of Resources: {num_resources:,}')\n",
+    "print('Resource Data: ' + \", \".join(key for key in resources.keys()))\n",
+    "\n",
+    "# And select a subset of these files\n",
+    "files = files[(start_page-1):end_page]\n",
+    "print(f'Selected {len(files)} files')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we will select the highest resolution .jpg image for each file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found 10 .jpg file URLs\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0018/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0019/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0020/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0021/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0022/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0023/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0024/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0025/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0026/full/pct:100/0/default.jpg\n",
+      "https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0027/full/pct:100/0/default.jpg\n"
+     ]
+    }
+   ],
+   "source": [
+    "urls = []\n",
+    "for i, file_sizes in enumerate(files):\n",
+    "    # only select files that are .jpg and have a height\n",
+    "    jpgs = [f for f in file_sizes if 'url' in f and f['url'].endswith('.jpg') and 'height' in f]\n",
+    "\n",
+    "    # Check to see if we have at least one .jpg image\n",
+    "    if len(jpgs) < 1:\n",
+    "        print(f'No .jpgs found in file #{i+1}. Skipping.')\n",
+    "        continue\n",
+    "\n",
+    "    # sort the jpgs by height, descending\n",
+    "    jpgs = sorted(jpgs, key=lambda f: -f['height'])\n",
+    "\n",
+    "    # choose the largest one\n",
+    "    urls.append(jpgs[0]['url'])\n",
+    "\n",
+    "print(f\"Found {len(urls)} .jpg file URLs\")\n",
+    "print(\"\\n\".join(urls))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## V. Downloading images into a PDF file\n",
+    "\n",
+    "Finally we will download each image to memory, convert them to images, then compile them into a single PDF file. You can change the resolution and file name in the code below.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "LOC Item Resources have been saved as pdf: output/sample.pdf\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Function that facilitates the download of the image using it's url\n",
+    "def download_image(url):\n",
+    "    response = requests.get(url)\n",
+    "    return Image.open(BytesIO(response.content))\n",
+    "\n",
+    "# Leveraging the list of urls, this function allows you to create the final .pdf file with the aggregate resources.\n",
+    "def create_pdf(image_urls, pdf_name):\n",
+    "    images = []\n",
+    "    for url in image_urls:\n",
+    "        image = download_image(url)\n",
+    "        images.append(image)\n",
+    "\n",
+    "    images[0].save(\n",
+    "        pdf_name, \"PDF\", resolution=100.0, save_all=True, append_images=images[1:]\n",
+    "    )\n",
+    "\n",
+    "    print(\"LOC Item Resources have been saved as pdf: \"+ pdf_name)\n",
+    "\n",
+    "# creating the PDF\n",
+    "pdf_name = 'output/sample.pdf'\n",
+    "create_pdf(urls, pdf_name)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/genindex.html b/genindex.html
@@ -162,6 +162,7 @@
         <p aria-level="2" class="caption" role="heading"><span class="caption-text">Loc.gov JSON API notebooks</span></p>
 <ul class="nav bd-sidenav">
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/LOC.gov%20JSON%20API.html">LOC.gov JSON API Guide</a></li>
+<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/resources2pdf.html">Downloading an item with multiple pages to PDF</a></li>
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Accessing%20images%20for%20analysis.html">Accessing images from the loc.gov JSON API for image analysis</a></li>
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Dominant%20colors.html">Color clusters in images in Library of Congress collections</a></li>
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Downloading_Monographs_as_Images_in_Rosenwald_Collection/Downloading%20Monographs%20as%20Images%20in%20Rosenwald%20Collection.html">Connecting the image file to the metadata</a></li>

diff --git a/intro.html b/intro.html
@@ -164,6 +164,7 @@
         <p aria-level="2" class="caption" role="heading"><span class="caption-text">Loc.gov JSON API notebooks</span></p>
 <ul class="nav bd-sidenav">
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/LOC.gov%20JSON%20API.html">LOC.gov JSON API Guide</a></li>
+<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/resources2pdf.html">Downloading an item with multiple pages to PDF</a></li>
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Accessing%20images%20for%20analysis.html">Accessing images from the loc.gov JSON API for image analysis</a></li>
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Dominant%20colors.html">Color clusters in images in Library of Congress collections</a></li>
 <li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Downloading_Monographs_as_Images_in_Rosenwald_Collection/Downloading%20Monographs%20as%20Images%20in%20Rosenwald%20Collection.html">Connecting the image file to the metadata</a></li>

diff --git a/loc.gov JSON API/Chronicling_America_Title_Essay_Datasets/download.html b/loc.gov JSON API/Chronicling_America_Title_Essay_Datasets/download.html
@@ -163,6 +163,7 @@
         <p aria-level="2" class="caption" role="heading"><span class="caption-text">Loc.gov JSON API notebooks</span></p>
 <ul class="nav bd-sidenav">
 <li class="toctree-l1"><a class="reference internal" href="../LOC.gov%20JSON%20API.html">LOC.gov JSON API Guide</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../resources2pdf.html">Downloading an item with multiple pages to PDF</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../Accessing%20images%20for%20analysis.html">Accessing images from the loc.gov JSON API for image analysis</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../Dominant%20colors.html">Color clusters in images in Library of Congress collections</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../Downloading_Monographs_as_Images_in_Rosenwald_Collection/Downloading%20Monographs%20as%20Images%20in%20Rosenwald%20Collection.html">Connecting the image file to the metadata</a></li>