Skip to content

Commit

Permalink
Added --user-scripts-timeout parameter (#3)
Browse files Browse the repository at this point in the history
Added --user-scripts-timeout parameter that allows a long-running async tasks to succeed.
  • Loading branch information
question44 authored Sep 20, 2023
1 parent fca79ad commit 80c2d13
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ All other request parameters are optional and have default values. However, you
| `stealth` | Stealth mode allows you to bypass anti-scraping techniques. It is disabled by default. | `false` |
| `screenshot` | If this option is set to true, the result will have the link to the screenshot of the page (`screenshot` field in the response). <b>Important implementation details</b>: Initially, Scrapper attempts to take a screenshot of the entire scrollable page. If it fails because the image is too large, it will only capture the currently visible viewport. | `false` |
| `user-scripts` | To use your JavaScript scripts on the page, add script files to the `user_scripts` directory, and list the required ones (separated by commas) in the `user-scripts` parameter. These scripts will execute after the page loads but before the article parser runs. This allows you to help parse the article in a variety of ways, such as removing markup, ad blocks, or anything else. For example: `user-scripts=remove_ads.js, click_cookie_accept_button.js` | |
| `user-scripts-timeout` | Waits for the given timeout in milliseconds after users scripts injection. For example if you want to navigate through page to specific content, set a longer period (higher value). The default value is 0, which means no sleep. | `0` |

#### Playwright settings
| Parameter | Description | Default |
Expand Down
3 changes: 3 additions & 0 deletions scrapper/core/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,9 @@ def page_processing(page, args, init_scripts=None):
for script_name in args.user_scripts:
page.add_script_tag(path=USER_SCRIPTS / script_name)

# wait for the given timeout in miliseconds after user scripts were injected.
if args.user_scripts_timeout:
page.wait_for_timeout(args.user_scripts_timeout)

def resource_blocker(whitelist): # list of resource types to allow
def block(route):
Expand Down
5 changes: 5 additions & 0 deletions scrapper/util/argutil.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,12 @@ def f(name, val):
# and list the required ones (separated by commas) in the `user-scripts` parameter. These scripts will execute after the page loads
# but before the article parser runs. This allows you to help parse the article in a variety of ways,
# such as removing markup, ad blocks, or anything else. For example: user-scripts=remove_ads.js, click_cookie_accept_button.js
# If you plan to run asynchronous long-running scripts, check --user_scripts_timeout parameter.
('user-scripts', (is_list,), None),
# Waits for the given timeout in milliseconds after injecting users scripts.
# For example if you want to navigate through page to specific content, set a longer period (higher value)
# The default value is 0, which means no sleep.
('user-scripts-timeout', (is_number, gte(0)), 0),

# # # Playwright settings:

Expand Down

0 comments on commit 80c2d13

Please sign in to comment.