-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
8 changed files
with
1,005 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,284 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Downloading an item with multiple pages to PDF\n", | ||
"\n", | ||
"The loc.gov API provides structured data about Library of Congress collections in JSON and YAML formats. This notebook shows how you can take use the API to access image resources, belonging to an LoC Item, and aggregate them into a single PDF file. \n", | ||
"\n", | ||
"## Understanding API Responses Review:\n", | ||
"\n", | ||
"**JSON Response Objects**\n", | ||
"Each of the endpoint types has a distinct response format, but they can be broadly grouped into two categories:\n", | ||
"- responses to queries for a list of items, or Search Results Responses \n", | ||
"- responses to queries for a **single item**, or Item and Resource Responses\n", | ||
"\n", | ||
"Furthermore, this notebook will focus on the **JSON Response Object** for a **single item** and formatting its corresponding Resources (files that make-up an item, e.g. pictures of book) into a .pdf file.\n", | ||
"\n", | ||
"## Prerequisites\n", | ||
"\n", | ||
"There are no prequisites in order to run this notebook, besides the installation of libraries listed in the imports section.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## I. Imports\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from PIL import Image\n", | ||
"import os\n", | ||
"from io import BytesIO\n", | ||
"import requests" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## II. Create a request URL\n", | ||
"\n", | ||
"First, we will start by ensuring we have a link to an item of interest. In this instance we will look at the [Benjamin Harrison Papers: Series 13, Venezuela Boundary Dispute, 1895-1899; Part 2, 1895-1899](https://www.loc.gov/item/mss250640164/) as an example.\n", | ||
"\n", | ||
"Notice the format of the link to this item: `https://www.loc.gov/item/mss250640164/`\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Item API Request URL: https://www.loc.gov/item/mss250640164/?fo=json\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"item_link=\"https://www.loc.gov/item/mss250640164/\"\n", | ||
"request_url = item_link + \"?fo=json\"\n", | ||
"\n", | ||
"# Note: The addition of the \"fo=json\" string ensures that the item request is in JSON format\n", | ||
"\n", | ||
"print(f'Item API Request URL: {request_url}')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We will also set a start and end page that we want to download and compile." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"start_page = 1 # starting a 1\n", | ||
"end_page = 10 # up to and including this page, make this -1 to retrieve all pages" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## III. Request Data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Top-level data structure:\n", | ||
"articles_and_essays, cite_this, item, more_like_this, options, related_items, resources, timestamp, type\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Generates request from LOC API to extract data in JSON format\n", | ||
"r = requests.get(request_url)\n", | ||
"data = r.json()\n", | ||
"# print(data)\n", | ||
"\n", | ||
"# Here is a quick way at looking at the structure of the data\n", | ||
"print(\"Top-level data structure:\\n\" + \", \".join(value for value in data.keys()))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## IV. Resource Data and Extracting Resource URLS\n", | ||
"\n", | ||
"In the previous code cell, you can see that the content itself has a lot of Metadata to that can be explored. However, in this notebook we will focus on access to information about the resources.\n", | ||
"\n", | ||
"As opposed to looking at item with `data['item']` we will look at the resources through `data['resources]`. Furthermore, we will be creating a list of all of the resources image urls with the best resolution (based on the largest height).\n", | ||
"\n", | ||
"First, let's just retrieve a list of resources/files" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Total # of Resources: 1,143\n", | ||
"Resource Data: caption, files, image, url\n", | ||
"Selected 10 files\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"resources = data['resources'][0]\n", | ||
"files = resources['files']\n", | ||
"num_resources = len(files)\n", | ||
"print(f'Total # of Resources: {num_resources:,}')\n", | ||
"print('Resource Data: ' + \", \".join(key for key in resources.keys()))\n", | ||
"\n", | ||
"# And select a subset of these files\n", | ||
"files = files[(start_page-1):end_page]\n", | ||
"print(f'Selected {len(files)} files')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next we will select the highest resolution .jpg image for each file" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Found 10 .jpg file URLs\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0018/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0019/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0020/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0021/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0022/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0023/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0024/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0025/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0026/full/pct:100/0/default.jpg\n", | ||
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0027/full/pct:100/0/default.jpg\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"urls = []\n", | ||
"for i, file_sizes in enumerate(files):\n", | ||
" # only select files that are .jpg and have a height\n", | ||
" jpgs = [f for f in file_sizes if 'url' in f and f['url'].endswith('.jpg') and 'height' in f]\n", | ||
"\n", | ||
" # Check to see if we have at least one .jpg image\n", | ||
" if len(jpgs) < 1:\n", | ||
" print(f'No .jpgs found in file #{i+1}. Skipping.')\n", | ||
" continue\n", | ||
"\n", | ||
" # sort the jpgs by height, descending\n", | ||
" jpgs = sorted(jpgs, key=lambda f: -f['height'])\n", | ||
"\n", | ||
" # choose the largest one\n", | ||
" urls.append(jpgs[0]['url'])\n", | ||
"\n", | ||
"print(f\"Found {len(urls)} .jpg file URLs\")\n", | ||
"print(\"\\n\".join(urls))\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## V. Downloading images into a PDF file\n", | ||
"\n", | ||
"Finally we will download each image to memory, convert them to images, then compile them into a single PDF file. You can change the resolution and file name in the code below.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 17, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"LOC Item Resources have been saved as pdf: output/sample.pdf\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Function that facilitates the download of the image using it's url\n", | ||
"def download_image(url):\n", | ||
" response = requests.get(url)\n", | ||
" return Image.open(BytesIO(response.content))\n", | ||
"\n", | ||
"# Leveraging the list of urls, this function allows you to create the final .pdf file with the aggregate resources.\n", | ||
"def create_pdf(image_urls, pdf_name):\n", | ||
" images = []\n", | ||
" for url in image_urls:\n", | ||
" image = download_image(url)\n", | ||
" images.append(image)\n", | ||
"\n", | ||
" images[0].save(\n", | ||
" pdf_name, \"PDF\", resolution=100.0, save_all=True, append_images=images[1:]\n", | ||
" )\n", | ||
"\n", | ||
" print(\"LOC Item Resources have been saved as pdf: \"+ pdf_name)\n", | ||
"\n", | ||
"# creating the PDF\n", | ||
"pdf_name = 'output/sample.pdf'\n", | ||
"create_pdf(urls, pdf_name)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.