[Self-Host] call to playwright is failing #902

rostwal95 · 2024-11-15T05:16:26Z

Describe the Issue
Call to playwright fails when trying to scrape with playwright.

To Reproduce
Steps to reproduce the issue:

Configure the environment or settings with '...'
Run the command '...'
Observe the error or unexpected output at '...'
Log output/error message

Expected Behavior
The call to playwright should be successful and dynamic js should be rendered and cleaned up.

Screenshots
If applicable, add screenshots or copies of the command line output to help explain the self-hosting issue.

Environment (please complete the following information):

OS: [e.g. macOS, Linux, Windows]
Firecrawl Version: [e.g. 1.2.3]
Node.js Version: [e.g. 14.x]
Docker Version (if applicable): [e.g. 20.10.14]
Database Type and Version: [e.g. PostgreSQL 13.4]

Logs
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:]: Engine docx meets feature priority threshold
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Sending request...
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Request sent failure status
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via fetch...

here are the logs

Configuration
Provide relevant parts of your configuration files (with sensitive information redacted).

Additional Context
Add any other context about the self-hosting issue here, such as specific infrastructure details, network setup, or any modifications made to the original Firecrawl setup.

mogery · 2024-11-15T10:19:24Z

Can you share the logs of the playwright microservice as well?

mkaskov · 2024-11-15T13:41:33Z

the same problem.
with apps/playwright-service-ts

playwright-service-1 | SyntaxError: Unexpected token " in JSON at position 0
playwright-service-1 | at JSON.parse ()
playwright-service-1 | at createStrictSyntaxError (/usr/src/app/node_modules/body-parser/lib/types/json.js:169:10)
playwright-service-1 | at parse (/usr/src/app/node_modules/body-parser/lib/types/json.js:86:15)
playwright-service-1 | at /usr/src/app/node_modules/body-parser/lib/read.js:128:18
playwright-service-1 | at AsyncResource.runInAsyncScope (node:async_hooks:203:9)
playwright-service-1 | at invokeCallback (/usr/src/app/node_modules/raw-body/index.js:238:16)
playwright-service-1 | at done (/usr/src/app/node_modules/raw-body/index.js:227:7)
playwright-service-1 | at IncomingMessage.onEnd (/usr/src/app/node_modules/raw-body/index.js:287:7)
playwright-service-1 | at IncomingMessage.emit (node:events:517:28)
playwright-service-1 | at endReadableNT (node:internal/streams/readable:1400:12)

mogery · 2024-11-15T14:19:13Z

I just made a change, I think the way we sent the request to the microservice was wrong. Can you rebuild firecrawl (no need to rebuild playwright-service) and try again?

rostwal95 · 2024-11-15T16:16:49Z

I am getting errors while building the docker container as well -

=> ERROR [playwright-service 2/6] RUN apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6 0.9s

[playwright-service 2/6] RUN apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6:
0.539 Get:1 http://deb.debian.org/debian bookworm InRelease [151 kB]
0.645 Err:1 http://deb.debian.org/debian bookworm InRelease
0.645 At least one invalid signature was encountered.
0.648 Get:2 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
0.677 Err:2 http://deb.debian.org/debian bookworm-updates InRelease
0.677 At least one invalid signature was encountered.
0.693 Get:3 http://deb.debian.org/debian-security bookworm-security InRelease [48.0 kB]
0.717 Err:3 http://deb.debian.org/debian-security bookworm-security InRelease
0.717 At least one invalid signature was encountered.
0.722 Reading package lists...
0.728 W: GPG error: http://deb.debian.org/debian bookworm InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian bookworm InRelease' is not signed.
0.728 W: GPG error: http://deb.debian.org/debian bookworm-updates InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian bookworm-updates InRelease' is not signed.
0.728 W: GPG error: http://deb.debian.org/debian-security bookworm-security InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian-security bookworm-security InRelease' is not signed.

failed to solve: process "/bin/sh -c apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6" did not complete successfully: exit code: 100

rostwal95 · 2024-11-15T16:25:57Z

I still see the issue, not sure why the logging level is not marked as error -

worker-1 | 2024-11-15 16:23:58 info [:]: 🐂 Worker taking job b2c3e207-55ca-4abb-8be1-57a0b1b88cd2
worker-1 | 2024-11-15 16:23:58 info [ScrapeURL:]: Scraping URL "https://www.britishairways.com/travel/home/public/en_us/"...
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine scrapingbee meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine scrapingbeeLoad meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine playwright meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine fetch meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine pdf meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine docx meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 info [ScrapeURL:]: Scraping via scrapingbee...
worker-1 | 2024-11-15 16:23:59 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b2c3e207-55ca-4abb-8be1-57a0b1b88cd2","method":"","engine":"scrapingbee","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Engine scrapingbee could not scrape the page.
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via scrapingbeeLoad...
worker-1 | 2024-11-15 16:23:59 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b2c3e207-55ca-4abb-8be1-57a0b1b88cd2","method":"","engine":"scrapingbeeLoad","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Engine scrapingbeeLoad could not scrape the page.
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2024-11-15 16:23:59 debug [ScrapeURL:scrapeURLWithPlaywright]: Sending request...
worker-1 | 2024-11-15 16:23:59 debug [ScrapeURL:scrapeURLWithPlaywright]: Request failed
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-15 16:24:01 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveHTMLFromRawHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveHTMLFromRawHTML (7ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveMarkdownFromHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveMarkdownFromHTML (1ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveLinksFromHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveLinksFromHTML (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveMetadataFromRawHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveMetadataFromRawHTML (4ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer uploadScreenshot...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer uploadScreenshot (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer performLLMExtract...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer performLLMExtract (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer coerceFieldsToFormats...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer coerceFieldsToFormats (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer removeBase64Images...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer removeBase64Images (0ms)
worker-1 | 2024-11-15 16:24:01 info [:]: 🐂 Job done b2c3e207-55ca-4abb-8be1-57a0b1b88cd2

response has empty markdown -

{
"success": true,
"data": {
"markdown": "",
"metadata": {
"title": "British Airways | Book Flights, Holidays, City Breaks & Check In Online",
"description": "Save on worldwide flights and holidays when you book directly with British Airways. Browse our guides, find great deals, manage your booking and check in online.",
"language": "en",
"robots": "all",
"ogLocaleAlternate": [],
"theme-color": "#ffffff",
"viewport": "width=device-width, initial-scale=1",
"sourceURL": "https://www.britishairways.com/travel/home/public/en_us/",
"url": "https://www.britishairways.com/travel/home/public/en_us/",
"statusCode": 200
}
}
}

mkaskov · 2024-11-20T06:53:17Z

another error. after that happens firecrawl start working not correct

worker-1 | 2024-11-20 06:40:56 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:56 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:57 info [:]: 🐂 Job done 79431bc8-736d-4379-bf0d-ddae76e0dabe
api-1 | 2024-11-20 06:40:57 warn [:]: You're bypassing authentication {}
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scraping via fetch...
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scraping via fetch...
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done 27e3c51f-1b4a-45a9-8b4f-abe61f67ac8a
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done e2ec67bb-740b-40fa-803d-4918ced6006c
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done a252a1c1-aa4c-4f0c-960b-c379292cb997
worker-1 | 2024-11-20 06:40:59 info [:]: 🐂 Worker taking job 0ad857e1-70f6-4c2d-9255-2c890f207c5a
worker-1 | 2024-11-20 06:40:59 error [:]: 🐂 Job errored 0ad857e1-70f6-4c2d-9255-2c890f207c5a - TypeError: Cannot read properties of undefined (reading 'timeout') {}
worker-1 | 2024-11-20 06:40:59 error [:]: undefined {}
worker-1 | 2024-11-20 06:40:59 error [:]: TypeError: Cannot read properties of undefined (reading 'timeout')
worker-1 | at processJob (/app/dist/src/services/queue-worker.js:249:40)
worker-1 | at processJobInternal (/app/dist/src/services/queue-worker.js:65:30)
worker-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {}
worker-1 | /app/dist/src/main/runWebScraper.js:18
worker-1 | formats: job.data.scrapeOptions.formats.concat(["rawHtml"]),
worker-1 | ^
worker-1 |
worker-1 | TypeError: Cannot read properties of undefined (reading 'formats')
worker-1 | at startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:18:49)
worker-1 | at processJob (/app/dist/src/services/queue-worker.js:245:57)
worker-1 | at processJobInternal (/app/dist/src/services/queue-worker.js:65:30)
worker-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
worker-1 |
worker-1 | Node.js v20.18.0
worker-1 exited with code 1

lauridskern · 2024-11-21T15:28:31Z

same issue for me

fatwang2 · 2024-11-27T08:31:59Z

same issue

rostwal95 added the self-host label Nov 15, 2024

mogery self-assigned this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Self-Host] call to playwright is failing #902

[Self-Host] call to playwright is failing #902

rostwal95 commented Nov 15, 2024

mogery commented Nov 15, 2024

mkaskov commented Nov 15, 2024 •

edited

Loading

mogery commented Nov 15, 2024

rostwal95 commented Nov 15, 2024

rostwal95 commented Nov 15, 2024 •

edited

Loading

mkaskov commented Nov 20, 2024

lauridskern commented Nov 21, 2024

fatwang2 commented Nov 27, 2024

[Self-Host] call to playwright is failing #902

[Self-Host] call to playwright is failing #902

Comments

rostwal95 commented Nov 15, 2024

mogery commented Nov 15, 2024

mkaskov commented Nov 15, 2024 • edited Loading

mogery commented Nov 15, 2024

rostwal95 commented Nov 15, 2024

=> ERROR [playwright-service 2/6] RUN apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6 0.9s

rostwal95 commented Nov 15, 2024 • edited Loading

mkaskov commented Nov 20, 2024

lauridskern commented Nov 21, 2024

fatwang2 commented Nov 27, 2024

mkaskov commented Nov 15, 2024 •

edited

Loading

rostwal95 commented Nov 15, 2024 •

edited

Loading