-
Notifications
You must be signed in to change notification settings - Fork 9
How To parse dynamically loaded websites
Web-crawling with Scrapy has its limits: If you encounter websites that dynamically load content into the DOM, you might have to expand your software toolkit to actually be able to gather the meta-data that you are looking for.
This holds especially true for websites that make heavy use of modern JavaScript-frameworks: What your scrapy.Spider
sees in the response.body
and - in contrast - how the HTML-Response looks like inside your browser's developer tools via the inspect-menu can be wildly different things.
When you're trying to figure out why a Scrapy Selector can't find the element you're looking for - or it actually shows up, but holds no content for you to parse - you might have just encountered a dynamically loaded element. To make sure that Scrapy really doesn't "see" the data you're looking for, confirm your CSS- or XPath-Selector within the Scrapy Shell: If the response is still empty or None
, even though it correctly selects the element in your browser's developer tools when you use CTRL+F
, it's time to find the data source. Your browser's network-tool should now be your first priority: While reloading a page or clicking elements, the network tool might show you specific requests (e.g. a GET-Request
that uses specific query parameters and corresponding values) and reproducing these requests might already yield the data that you were initially looking for.
If this is the case, please take a look at our GitHub Wiki topic considering REST-Clients, APIs and how to interact with them.
If the data you're receiving from specific Requests isn't doesn't cover everything that you were looking for (or the Requests don't show up at all in your network-tool's overview), you might want to give the JavaScript rendering service Splash a go.
Splash can be easily integrated into your Scrapy environment using scrapy-splash. The quickest way to use Splash (if you want to have a local instance for testing and debugging), is by first installing Splash within Docker and starting its container. If you're using a Linux distribution, for the purpose of this explanation we're using Ubuntu, the installation process is fairly short:
- Make sure that you have successfully installed the Docker Engine by typing
sudo docker run hello-world
in your Terminal / Console. - Now install the Splash container:
sudo docker pull scrapinghub/splash
- and start it with
sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
If step (1) fails with a permission denied
-error-message, you might have to add your current user to the docker group (see: Stack Overflow discussion regarding the "permission denied"-issue): sudo usermod -aG docker $USER
. Either run newgrp docker
afterwards from your Terminal or logout/login or reboot and you should be able to continue with step (2) and (3).
For more elaborate installation instructions, please consider checking out the Splash Installation Documentation.
Now that you have installed Splash, you can make use of the .env
-settings located in your converter/
-folder, please take a look at our "How-To set up your .env file"-wiki-article. Inside the .env
-file you can configure your settings for the Splash instance that you want to use with Scrapy.
Your Splash container allows you to make use of its HTTP API with Splash Scripts, e.g. to get HTML results or take screenshots of websites that use JavaScript. Some websites might only show you specific DOM elements after you've interacted with the website by:
- waiting a specific amount of time
- scrolling on the X- or Y- axis (especially on dynamic websites with "endless scrolling"-implementations)
- clicking a specific element
- manipulating your viewport in size
You can find out if that is the case by interacting with your local Splash instance's web-interface with your browser by accessing the URL 0.0.0.0:8050 as long as your docker container is running. (Don't forget to start your docker container after you've rebooted your PC!)
Splash's render.html- and splash:html-methods return a .html
-DOM which can be quickly inspected by using Splash's web-interface, where you can also customize the Splash Script that is used upon accessing the target URL.
If the website is still not rendered correctly, the Scrapy Documentation recommends using a headless browser. While the Scrapy developers recommend Selenium together with the scrapy-selenium middleware, we tried a different route and used Playwright for its ease of use.
Playwright is an open-source project for end-to-end testing and browser automation maintained by Microsoft under the Apache-2.0 license. Some of its maintainers already worked on Puppeteer, which is maintained by the Chrome DevTools team at Google. Compared to Puppeteer, Playwright allows us to automate a headless browser instance of Chromium, WebKit or Firefox and use these to extract data from websites that the Scrapy framework normally wouldn't be able to render. If our GUI-driven browser can see something on a website, the headless browser instance should also be able see the same result.
Since our crawlers are built using Python and Scrapy, it would be great if we could control and automate our headless browser while using the same programming language. Thankfully Playwright for Python is available as one of the many supported languages to use with the Playwright API.
Bundled within our docker_compose.yml
comes a container for headless_chrome
, which is a Python port of Puppeteer called Pyppeteer. The APIs between Puppeteer/Pyppeteer/Playwright are pretty similar, which is why the following instructions are pretty much interchangeable. Skip the "Playwright Installation"-part if you want to use the Pyppeteer docker container that comes with our oeh-search-etl
-installation.
Just make sure that the container is actually running by typing docker-compose up
into your Terminal (on project root directory) at least once. By checking docker stats
you should now see a container for Splash and Pypeeteer (displayed as headless_chrome
) running. In your .env
-file you should see the setting for Pyppeteer: PYPPETEER_WS_ENDPOINT = "ws://localhost:3000"
.
The implementation of Pyppeteer can be found in converter/web_tools.py
. The following crawlers make use of Pyppeteer, which should give you a rough idea how to grab data with Pyppeteer that can't be seen by Scrapy on its own:
converter/spiders/kmap_spider.py
converter/spiders/zum_mathe_apps_spider.py
converter/spiders/zum_physik_apps_spider.py
Using Playwright alongside Scrapy within our existing project is fairly straightforward if you take a peek at the "Getting Started"-Guide: To install playwright within our virtual environment (venv
) all we need to do is enter these commands in our terminal / console:
pip install playwright
playwright install
playwright install-deps
If you've cloned our openeduhub/oeh-search-etl repository, you'll see playwright already listed inside the requirements.txt file. If you allow your IDE to install the packages inside the requirements.txt, you can skip the first line and go straight to entering playwright install
in your terminal.
For this example we're trying to parse the JSON-LD metadata from KMap.eu, a fantastic web-site for knowledge-maps focused on mathematics and physics. For this example we'll parse the topic "Kinematik" from KMap.
The meta-data-container we're looking for is sitting inside a script-block and looks like this:
<script id="ld" type="application/ld+json"></script>
Using the Scrapy Shell with an XPath-expression for the element's id, we are able to see these results:
>>> response.xpath('//*[@id="ld"]')
[<Selector xpath='//*[@id="ld"]' data='<script id="ld" type="application/ld+...'>]
>>> response.xpath('//*[@id="ld"]').get()
'<script id="ld" type="application/ld+json">{}</script>'
The first request tells us that we're on the right track: The selector seems to be correct, so we're using the .get()
-method on it in the hope of our our selector being able to extract the desired data inside the <script>
. What we expect is metadata encoded in JSON-LD syntax, but what we receive are empty {}
-brackets. Not good.
By looking into our browser's developer tools and inspecting the DOM, we know for sure that there should be the following data inside the JSON:
{
"@context": "https://schema.org",
"@type": "WebPage",
"breadcrumb": {
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"name": "Physik",
"item": "https://kmap.eu/app/browser/Physik"
},
{
"@type": "ListItem",
"position": 2,
"name": "Kinematik",
"item": "https://kmap.eu/app/browser/Physik/Kinematik"
}
]
},
"mainEntity": {
"@type": "Article",
"headline": "Kinematik",
"name": "Kinematik",
"description": "Bewegungslehre",
"keywords": "Physik, Kinematik, Allgemeines, Bezugssysteme, Konstante Geschwindigkeit, Konstante Beschleunigung",
"mainEntityOfPage": "https://kmap.eu/app/browser/Physik/Kinematik",
"image": "https://kmap.eu/app/icons/KMap-Logo-cropped.png",
"datePublished": "2021-07-08T11:31:42.304Z",
"author": {
"@type": "Organization",
"name": "KMap Team"
},
"publisher": {
"@type": "Organization",
"name": "KMap Team",
"email": "[email protected]",
"logo": {
"@type": "ImageObject",
"url": "https://kmap.eu/app/icons/KMap-Logo-cropped.png"
}
},
"license": "https://creativecommons.org/licenses/by-sa/4.0/",
"inLanguage": [
"de"
],
"audience": [
"Lerner/in"
],
"about": [
"Physik"
],
"learningResourceType": [
"Unterrichtsplanung"
]
}
}
We're now using Scrapy in conjunction with Playwright to extract the data that we came for. Scrapy handles the site navigation while Playwright is only called to return the JSON-LD dictionary.
import json
import pprint
import scrapy.http
from playwright.sync_api import sync_playwright
class KMapSpider(scrapy.Spider):
name = "kmap_spider"
friendlyName = "KMap.eu"
version = "0.0.1"
start_urls = [
"https://kmap.eu/app/browser/Physik/Kinematik"
]
playwright_instance = None
browser_instance = None
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
self.playwright_instance = sync_playwright().start()
self.browser_instance = self.playwright_instance.chromium.launch()
def parse(self, response, **kwargs):
ld_json = self.grab_json_ld(response.url)
pp = pprint.PrettyPrinter(indent=2)
print("The LD_JSON dictionary:", pp.pprint(ld_json))
pass
def grab_json_ld(self, url_to_crawl, **kwargs) -> dict:
context = self.browser_instance.new_context()
page = context.new_page()
page.goto(url_to_crawl)
json_ld_string = page.text_content('//*[@id="ld"]')
json_ld: dict = json.loads(json_ld_string)
context.close()
return json_ld
def close(self, reason):
print("CLOSE METHOD: SHUTTING DOWN BROWSER + PLAYWRIGHT")
self.browser_instance.close()
self.playwright_instance.stop()
pass
Inside the start_requests()
-method we're opening the headless browser instance that will be kept open as long as our scrapy.Spider
is running. For each URL listed in the start_urls
-list, we're calling the parse()
-method to crawl the URL with Scrapy. As soon as the grab_json_ld
-method is called, Playwright opens a new BrowserContext and open's a new Playwright Page, which is basically a "Tab" inside your headless browser and used for navigation.
The page.text_content()
-method uses our previously tried Selector to fetch the element into a string and then transform it with json.loads()
into a easy-to-use python dictionary that we can later return to our parse()
-method. Once the grab_json_ld
-method is finished, it will close the Playwright context.
The close(Spider)
-method is called when the scrapy.Spider
is done with its workload and has finished all scrapy.Requests
. By calling close()
on our browser's instance, we make sure that our headless chromium instance shuts down properly as soon as it's no longer needed and afterwards we can stop the Playwright-process as well.
The interesting part of the terminal output while running our spider with scrapy crawl kmap_spider
will look like this:
[...]
2021-07-09 11:17:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kmap.eu/app/browser/Physik/Kinematik> (referer: None)
{ '@context': 'https://schema.org',
'@type': 'WebPage',
'breadcrumb': { '@context': 'https://schema.org',
'@type': 'BreadcrumbList',
'itemListElement': [ { '@type': 'ListItem',
'item': 'https://kmap.eu/app/browser/Physik',
'name': 'Physik',
'position': 1},
{ '@type': 'ListItem',
'item': 'https://kmap.eu/app/browser/Physik/Kinematik',
'name': 'Kinematik',
'position': 2}]},
'mainEntity': { '@type': 'Article',
'about': ['Physik'],
'audience': ['Lerner/in'],
'author': {'@type': 'Organization', 'name': 'KMap Team'},
'datePublished': '2021-07-09T09:17:52.891Z',
'description': 'Bewegungslehre',
'headline': 'Kinematik',
'image': 'https://kmap.eu/app/icons/KMap-Logo-cropped.png',
'inLanguage': ['de'],
'keywords': 'Physik, Kinematik, Allgemeines, Bezugssysteme, '
'Konstante Geschwindigkeit, Konstante '
'Beschleunigung',
'learningResourceType': ['Unterrichtsplanung'],
'license': 'https://creativecommons.org/licenses/by-sa/4.0/',
'mainEntityOfPage': 'https://kmap.eu/app/browser/Physik/Kinematik',
'name': 'Kinematik',
'publisher': { '@type': 'Organization',
'email': '[email protected]',
'logo': { '@type': 'ImageObject',
'url': 'https://kmap.eu/app/icons/KMap-Logo-cropped.png'},
'name': 'KMap Team'}}}
2021-07-09 11:17:53 [scrapy.core.engine] INFO: Closing spider (finished)
CLOSE METHOD: SHUTTING DOWN BROWSER + PLAYWRIGHT
2021-07-09 11:17:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
[...]
Now that we are sure that our crawler can extract the data from the JSON-LD-<script>
, we are prepared for the next steps in the ETL-process.