Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post scrape callback #9

Open
DerManoMann opened this issue Jan 11, 2012 · 2 comments
Open

post scrape callback #9

DerManoMann opened this issue Jan 11, 2012 · 2 comments
Labels

Comments

@DerManoMann
Copy link

Hi there,

First of all: I really like the project, good stuff!

Now to my request ;)

I use pjscrape to scrape a single file and then do some post processing of the collected data. Right now the post processing doesn't seem to be possible.
It would be great to have a post scrape callback, ideally with the data collected available.

@nrabinowitz
Copy link
Owner

I was going to say that you could do this in a custom writer, and you can, but it's more of a pain than I initially thought, and you can't really leverage the code in the base writer.

Just to make sure I understand the request here - you want to:

  1. Scrape some data from one or more pages
  2. Post-process the data with a custom function
  3. Use the existing writers and formatters to write your output (e.g. JSON to STDOUT)

Is that right? If you wanted to do some custom writing in (3), I'd say just make a new writer, but if you want to take advantage of the existing writers and formatters you do need an addition to the library.

@DerManoMann
Copy link
Author

Actually, after creating the issue it occurred to me that I could use a writer...

I would like to do the post processing in the same process, as that processing will need to use phantom too.

What I need to archive is to:

  • scrape a page
  • replace svg with img elements
  • process the svg into temp. image files
  • save the final HTML
  • write a summary file with images processed, etc
    The point is to convert a complex JS driven page into a static HTML page.

I think I can live with a custom writer, n particular since I can add custom config options to drive the process.
In that context it is quite nice to be able to provide multiple files on the command line, since that means I can one file for the actual code and another that I create on the fly that contains just url and other config settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants