-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool for scraping crop data from practicalplants wiki #83
Comments
@petteripitkanen I'll take this one. |
@petteripitkanen I'm about half way there. I do have one question. I'm going to dump the data in a database. I'm debating between mongodb and postgres. Any thoughts? One plus point with postgres is I can run SQL queries against it. I also don't see an issue with devising a schema. |
The main output format should be a plain JS file, the one that is currently used. I don't see benefit of dumping the data to a database, but having multiple output formats seems okay to me. |
@petteripitkanen I have the tool ready for use. Note it only does JSON but I'm now adding support for other formats. I just wanted you to get a first look. Any and all feedback welcome. I tried it and the file size is 23M (current one is 17M). I spot checked a few plants and it has all uses. My repo is here: https://github.com/l0n3star/scraper |
Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully
|
Thanks for great feedback !
…On Wed, Sep 11, 2019 at 10:44 AM petteripitkanen ***@***.***> wrote:
Thanks, I have tried it, it is looking good. These are my comments for the
moment, they are mostly about practical problems that I found, I haven't
yet gone so deeply into the code. I think eventually it would be good to
integrate this to powerplant code base. Also when the static crop data is
updated it is necessary to produce some sort of diff to see easily that
there are no regressions (hopefully git diff is clear enough when the
crops are sorted to the same order).
- If processPlantContent fails for a crop then this crop is not
included in the output. There can easily be connection problems during a
run, so a retry mechanism is needed.
- It looks like some crops don't include property binomial, for
example Rosmarinus officinalis
<https://practicalplants.org/w/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=json&titles=Rosmarinus%20officinalis>,
but then title seems to contain the binomial name. There could be
logic for this: if binomial is not included, fill it with data from
title?
- Property functions is always a string while it should be an array of
objects that have the function property.
- It looks like forage should also be an array, currently it is a
string, and sometimes it contains }} at the end.
- It would be good to double-check if other known array properties are
parsed correctly.
- It would be good to double-check if there are more properties that
are missing data when compared to raw MediaWiki crop data, maybe there are
other properties that we have partly missed before?
- It'd be clearer to have one async function that fetches the crops
and returns an array of JS objects, this way the output step could be
separated from the fetching step, and it would be easier to add support for
different output formats.
- It'd be ok to use a MediaWiki parser library if there is one
available that seems suitable, though the parsing logic don't seem to be
that complex so I think it is fine to go with fixing the hand-written
parser as well.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#83?email_source=notifications&email_token=ALAQO3255YSKLYCXHJY5ZZLQJEVABA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PJ4GQ#issuecomment-530488858>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALAQO35BG57DJ7Y2Q4L5UELQJEVABANCNFSM4IMTMGQA>
.
|
I did some related debugging in #97, it seems that raw (practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays. It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step. Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing:
|
Thank you. I will take a look.
…On Wed, Sep 18, 2019 at 9:05 AM petteripitkanen ***@***.***> wrote:
I did some related debugging in #97
<#97>, it seems that raw (
practicalplants.org) MediaWiki data has two types of properties: strings
and arrays of objects. Though sometimes the strings are actually
CSV-encoded arrays.
It looks preferable that the parser understands only these two types (raw
strings, arrays of objects), and the conversion from CSV to array is done
in another step.
Currently all properties that are arrays of objects are incomplete, so for
all these properties there is potentially data missing:
- edible part and use
- medicinal part and use
- material part and use
- toxic parts
- functions
- shelter
- forage
- crops
- subspecies
- cultivar groups
- ungrouped cultivars
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#83?email_source=notifications&email_token=ALAQO332W6FFSG45CWVF3DLQKJGTDA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ASZPA#issuecomment-532753596>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALAQO36VRS2QF7FNHZ7744LQKJGTDANCNFSM4IMTMGQA>
.
|
I have taken a look at available JS parsers, for instance infobox-parser and wtf_wikipedia (this is Wikipedia-specific but it is possible to trick the parser by changing the template name Actually tackling the general case even for a single template expression seems to require a full wikitext parser, and basic wikitext is not the easiest to parse to start with, and then it is interleaved with HTML and whatnot, so the task of parsing wikitext to a data structure is quite complex. MediaWiki itself doesn't seem to have such a parser, only code to convert wikitext to HTML for display. Our case is a bit easier as the practicalplants.org data for the For this the way to go could be to:
|
I'll get started on adding raw wikitext. |
The extended structure could look like this |
While there don't seem to be a JS parser available that generates a complete AST of the nested template structure, it might be useful to use the partial parses and fill in details, to significantly ease the remaining parsing process. Input:
Output of
Output of
|
Good idea to use partial parses. I will write the tests before completing the parser. This way I'll have clarity. |
I could do the parser, it doesn't seem to take that many lines to write a recursive descent parser that produces an AST with the limitation of accepting only inputs where the tokens For now I'd probably accept a PR that extends practicalplants.js to include raw wikitext (as explained on previous comment). Your tool could be useful for fetching the raw wikitexts for this PR. |
Sounds fair. I might even pick up the dragon book to understand more on compiler design :) |
I found a parsing library called chevrotain. It lets you define a grammar and generates an AST for you. Mind if I try this out or do you still prefer to write your own parser? |
I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers. You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit. |
I understand. I think it makes sense for powerplant to be further
developed then. Thank you for your extremely valuable feedback on my
PR's.
…On Tue, Oct 8, 2019 at 6:42 PM petteripitkanen ***@***.***> wrote:
I have removed the "good first issue" label from all issues since I feel
that none of them are defined clearly enough to be done by a newcomer who
by definition doesn't have an overall view of the project. As we are
currently in the process of defining this project more clearly, the
conditions are not easy for small contributions. Once the development gets
more stable I'll perhaps also have a better view of tasks that would be
good for newcomers.
You are welcome to continue exploring different ways to parse the
practicalplants MediaWiki format (and powerplant in general), but please
note that I also don't currently have a complete picture how powerplant
should look like, so if you continue giving me input in the form of
comments and PRs, I do try to eventually evaluate them, but the point of
view of the evaluation is what would be good for powerplant overall, so it
is likely that even if something works it won't get merged if it is against
the overall design, because otherwise it would get largely reverted in the
next commit.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#83?email_source=notifications&email_token=ALAQO34VV4HLH7ZHPIZDA6LQNUZJFA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAWGSJY#issuecomment-539781415>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALAQO367FYC5GYO7DBG5AKDQNUZJFANCNFSM4IMTMGQA>
.
|
Create a tool for scraping crop data from practicalplants wiki. Try to divide it into functions that could also be useful for #80. The output format should be like
practicalplants.json
, or a symmetrical plain JS file (#70).Currently
practicalplants.json
is missing some data for the properties edible part and use, medicinal part and use, material part and use, the tool should handle these correctly.The tool should be written in JS but it can be based on the Python tool that was originally used for generating
practicalplants.json
.The text was updated successfully, but these errors were encountered: