Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using only one browser for web scraping #1

Open
AndyS1mpson opened this issue Jul 28, 2022 · 1 comment
Open

Using only one browser for web scraping #1

AndyS1mpson opened this issue Jul 28, 2022 · 1 comment

Comments

@AndyS1mpson
Copy link

AndyS1mpson commented Jul 28, 2022

Hi! Thank you for your repository, but I have one question regarding the implementation. You use browser_type.launch(proxy=proxy) in every task. If I understand correctly, it turns out that a new browser starts in each new task of celery

I'm trying to optimize the parsing process. To do this, I transfer the browser to a separate docker container with the
npx playwright run-server command and connect to it via a web socket.

But I have a feeling that I'm doing something wrong, since the load has not decreased and the parsing speed has not increased. Do you know anything about this and can you help?

Thank you in advance

@AndyS1mpson AndyS1mpson changed the title Using one browser for web scraping Using only one browser for web scraping Jul 28, 2022
@AnderRV
Copy link
Member

AnderRV commented Jul 29, 2022

Hi Andy,
As you say, we launch a browser for each request, which is an overkill in a real-world use case.

If you have a browser in a different container, you could leave it running and create only new pages/contexts per each request you need. And set the proxy in each of those. This approach will incur in some networking overhead, but maybe less than launching new browsers each request.

This approach needs close monitoring of the container and the used memory, since browsers can leak. And running them for long periods of time can lead to performance decay over time.

To improve speed (if this approach does not work), you can always block resources and save time/bandwidth.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants