Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add Scraper/ScraperRun classes #7

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

tmtmtmtm
Copy link
Contributor

@tmtmtmtm tmtmtmtm commented Feb 6, 2017

First stab at factoring out a Scraper class and a ScraperRun class.

The IndexToMemebers class is probably overkill for now — we can find the
higher level abstractions like that after we nail down the basics — but
it's probably useful to include in this intial WIP which is more for
feedback than expected to actually merge yet.

@tmtmtmtm tmtmtmtm requested a review from chrismytton February 6, 2017 11:43
First stab at factoring out a Scraper class and a ScraperRun class.

The IndexToMemebers class is probably overkill for now — we can find the
higher level abstractions like that after we nail down the basics — but
it's probably useful to include in this intial WIP which is more for
feedback than expected to actually merge yet.
@chrismytton
Copy link
Collaborator

Part of everypolitician/everypolitician#572.

@chrismytton
Copy link
Collaborator

The ScraperRun class is part of everypolitician/everypolitician#574.

@chrismytton chrismytton self-assigned this Feb 8, 2017
This gives us the full stack trace and ensures that the program actually
exits after an error.
This was trying to call `#keys` on an array, which was raising an error.
For now just use the first item from the data array.
Not all records seem to have a start date, so removing this from the
list of default index fields for now.
This doesn't need to be a public method as it's only used internally.
Rather than trying to find the correct generic abstraction this forces
the scrapers to have a class that encapsulates navigating a website to
get the correct content.

In the future we can potentially wrap these up in `IndexToMembers` type
classes, but for now it seems easier to just do scraper specific ones
until we find the correct abstraction.
@tmtmtmtm
Copy link
Contributor Author

tmtmtmtm commented May 6, 2017

Thinking about this a little more, I fear that we got a key underlying concept back to front here. Here we say that a Scraper has a ScraperRun. But I think we're being led astray by two subtly different uses of the world scraper. The Scraper class here wraps an interaction with Scraped, but that's not the same use of a scraper in the morph sense, which might (and often does) require lots of calls to actually fetch, parse, and extract data from multiple pages.

I now think this should all be the other way around: a ScraperRun should be the primary interface, rather than essentially a private class.

@chrismytton chrismytton removed their assignment Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants