Tool for scraping crop data from practicalplants wiki #83

petteripitkanen · 2019-08-18T15:58:09Z

Create a tool for scraping crop data from practicalplants wiki. Try to divide it into functions that could also be useful for #80. The output format should be like practicalplants.json, or a symmetrical plain JS file (#70).

Currently practicalplants.json is missing some data for the properties edible part and use, medicinal part and use, material part and use, the tool should handle these correctly.

The tool should be written in JS but it can be based on the Python tool that was originally used for generating practicalplants.json.

The text was updated successfully, but these errors were encountered:

l0n3star · 2019-08-24T02:07:19Z

@petteripitkanen I'll take this one.

l0n3star · 2019-08-29T18:29:00Z

@petteripitkanen I'm about half way there. I do have one question. I'm going to dump the data in a database. I'm debating between mongodb and postgres. Any thoughts? One plus point with postgres is I can run SQL queries against it. I also don't see an issue with devising a schema.

petteripitkanen · 2019-08-30T13:20:11Z

The main output format should be a plain JS file, the one that is currently used. I don't see benefit of dumping the data to a database, but having multiple output formats seems okay to me.

l0n3star · 2019-09-09T03:12:42Z

@petteripitkanen I have the tool ready for use. Note it only does JSON but I'm now adding support for other formats. I just wanted you to get a first look. Any and all feedback welcome. I tried it and the file size is 23M (current one is 17M). I spot checked a few plants and it has all uses. My repo is here: https://github.com/l0n3star/scraper

petteripitkanen · 2019-09-11T17:44:21Z

Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully git diff is clear enough when the crops are sorted to the same order).

If processPlantContent fails for a crop then this crop is not included in the output. There can easily be connection problems during a run, so a retry mechanism is needed.
It looks like some crops don't include property binomial, for example Rosmarinus officinalis, but then title seems to contain the binomial name. There could be logic for this: if binomial is not included, fill it with data from title?
Property functions is always a string while it should be an array of objects that have the function property.
It looks like forage should also be an array, currently it is a string, and sometimes it contains }} at the end.
It would be good to double-check if other known array properties are parsed correctly.
It would be good to double-check if there are more properties that are missing data when compared to raw MediaWiki crop data, maybe there are other properties that we have partly missed before?
It'd be clearer to have one async function that fetches the crops and returns an array of JS objects, this way the output step could be separated from the fetching step, and it would be easier to add support for different output formats.
It'd be ok to use a MediaWiki parser library if there is one available that seems suitable, though the parsing logic don't seem to be that complex so I think it is fine to go with fixing the hand-written parser as well.

l0n3star · 2019-09-11T18:18:30Z

Thanks for great feedback !

…

On Wed, Sep 11, 2019 at 10:44 AM petteripitkanen ***@***.***> wrote: Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully git diff is clear enough when the crops are sorted to the same order). - If processPlantContent fails for a crop then this crop is not included in the output. There can easily be connection problems during a run, so a retry mechanism is needed. - It looks like some crops don't include property binomial, for example Rosmarinus officinalis <https://practicalplants.org/w/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=json&titles=Rosmarinus%20officinalis>, but then title seems to contain the binomial name. There could be logic for this: if binomial is not included, fill it with data from title? - Property functions is always a string while it should be an array of objects that have the function property. - It looks like forage should also be an array, currently it is a string, and sometimes it contains }} at the end. - It would be good to double-check if other known array properties are parsed correctly. - It would be good to double-check if there are more properties that are missing data when compared to raw MediaWiki crop data, maybe there are other properties that we have partly missed before? - It'd be clearer to have one async function that fetches the crops and returns an array of JS objects, this way the output step could be separated from the fetching step, and it would be easier to add support for different output formats. - It'd be ok to use a MediaWiki parser library if there is one available that seems suitable, though the parsing logic don't seem to be that complex so I think it is fine to go with fixing the hand-written parser as well. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#83?email_source=notifications&email_token=ALAQO3255YSKLYCXHJY5ZZLQJEVABA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PJ4GQ#issuecomment-530488858>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALAQO35BG57DJ7Y2Q4L5UELQJEVABANCNFSM4IMTMGQA> .

petteripitkanen · 2019-09-18T16:05:05Z

I did some related debugging in #97, it seems that raw (practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays.

It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step.

Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing:

edible part and use
medicinal part and use
material part and use
toxic parts
functions
shelter
forage
crops
subspecies
cultivar groups
ungrouped cultivars

l0n3star · 2019-09-18T17:04:22Z

Thank you. I will take a look.

…

On Wed, Sep 18, 2019 at 9:05 AM petteripitkanen ***@***.***> wrote: I did some related debugging in #97 <#97>, it seems that raw ( practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays. It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step. Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing: - edible part and use - medicinal part and use - material part and use - toxic parts - functions - shelter - forage - crops - subspecies - cultivar groups - ungrouped cultivars — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#83?email_source=notifications&email_token=ALAQO332W6FFSG45CWVF3DLQKJGTDA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ASZPA#issuecomment-532753596>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALAQO36VRS2QF7FNHZ7744LQKJGTDANCNFSM4IMTMGQA> .

petteripitkanen · 2019-09-22T10:24:39Z

I have taken a look at available JS parsers, for instance infobox-parser and wtf_wikipedia (this is Wikipedia-specific but it is possible to trick the parser by changing the template name Plant to Infobox element), and it seems that none of these handle nested templates properly.

Actually tackling the general case even for a single template expression seems to require a full wikitext parser, and basic wikitext is not the easiest to parse to start with, and then it is interleaved with HTML and whatnot, so the task of parsing wikitext to a data structure is quite complex. MediaWiki itself doesn't seem to have such a parser, only code to convert wikitext to HTML for display.

Our case is a bit easier as the practicalplants.org data for the Plant template is quite regular with only small amount of HTML and other irregularities. I have preliminary plans for powerplant to be able to use practicalplants.org for automatically populating a local MediaWiki instance, and then synchronizing the local MediaWiki with powerplant, allowing one to edit and browse the crop collection with MediaWiki, so for this I'd like to have a special practicalplants wikitext parser in powerplant.

For this the way to go could be to:

Extend the PracticalplantsCrop dataset in db/practicalplants-data.js to also contain the raw wikitext of the Plant template. This could be done in one PR.
Start writing a parser and test it using the raw wikitext data. There are already some tests written in Create modules for Crop and PracticalplantsCrop types #97. The parser could start incomplete and be iteratively improved through multiple PRs.
Once the parser passes all tests, update db/practicalplants-data.js with the parsed objects.

l0n3star · 2019-09-23T23:52:16Z

I'll get started on adding raw wikitext.

petteripitkanen · 2019-09-24T05:11:12Z

The extended structure could look like this [{ wikitext: String, object: PracticalplantsCrop }, ...], and then there could be two functions in db/practicalplants-data.js, getCrops for getting the parsed objects and getWikitextObjectPairs for getting the whole data structure. With this structure it would be easy to compare wikitext and parsed objects in a diff.

petteripitkanen · 2019-09-24T10:29:00Z

While there don't seem to be a JS parser available that generates a complete AST of the nested template structure, it might be useful to use the partial parses and fill in details, to significantly ease the remaining parsing process.

Input:

{{Plant
|common=Rosemary
|family=Lamiaceae
|primary image=Rosmarinus officinalis.jpg
|forage={{Plant provides forage for|forage=Bees}}
|edible part and use={{Has part with edible use
|part used=Leaves
|part used for=Herbs
}}{{Has part with edible use
|part used=Leaves
|part used for=Dried
}}{{Has part with edible use
|part used=Flowers
|part used for=Salads
}}
|material part and use=
|medicinal part and use=
|sun=full sun
|shade=no shade
|hardiness zone=7
|heat zone=
|water=low
|drought=tolerant
|soil water retention=well drained, moist
|soil texture=sandy, loamy
|soil ph=acid, neutral, alkaline
|wind=No
|maritime=Yes
|native range=South Europe, West Asia
|ecosystem niche=Shrub
|life cycle=perennial
|herbaceous or woody=woody
|deciduous or evergreen=evergreen
|
|fertility=self fertile
|mature measurement unit=metres
|mature height=1.2
|mature width=1.2
|flower colour=blue
|grow from=seed, cutting
|seed requires stratification=No
|seed dormancy depth=
|seed requires scarification=No
|seed requires smokification=No
|cutting type=semi-ripe
|bulb type=
|graft rootstock=
|edible parts=flowers, leaves
|edible uses=Herb, Salad, Dry
}}

Output of infobox-parser(input):

{ general:
   { common: 'Rosemary',
     family: 'Lamiaceae',
     primaryImage: 'Rosmarinus officinalis.jpg',
     forage: 'Plant provides forage for',
     ediblePartAndUse: 'Has part with edible use',
     partUsed: 'Flowers',
     partUsedFor: 'Salads',
     sun: 'full sun',
     shade: 'no shade',
     hardinessZone: '7',
     water: 'low',
     drought: 'tolerant',
     soilWaterRetention: 'well drained, moist',
     soilTexture: 'sandy, loamy',
     soilPh: 'acid, neutral, alkaline',
     wind: 'No',
     maritime: 'Yes',
     nativeRange: 'South Europe, West Asia',
     ecosystemNiche: 'Shrub',
     lifeCycle: 'perennial',
     herbaceousOrWoody: 'woody',
     deciduousOrEvergreen: 'evergreen',
     fertility: 'self fertile',
     matureMeasurementUnit: 'metres',
     matureHeight: '1.2',
     matureWidth: '1.2',
     flowerColour: 'blue',
     growFrom: 'seed, cutting',
     seedRequiresStratification: 'No',
     seedRequiresScarification: 'No',
     seedRequiresSmokification: 'No',
     cuttingType: 'semi-ripe',
     edibleParts: 'flowers, leaves',
     edibleUses: 'Herb, Salad, Dry' },
  tables: [],
  lists: [] }

Output of wtf_wikipedia(input).templates():

[ { forage: 'Bees', template: 'plant provides forage for' },
  { 'part used': 'Leaves',
    'part used for': 'Herbs',
    template: 'has part with edible use' },
  { 'part used': 'Leaves',
    'part used for': 'Dried',
    template: 'has part with edible use' },
  { 'part used': 'Flowers',
    'part used for': 'Salads',
    template: 'has part with edible use' },
  { common: 'Rosemary',
    family: 'Lamiaceae',
    'primary image': 'Rosmarinus officinalis.jpg',
    sun: 'full sun',
    shade: 'no shade',
    'hardiness zone': '7',
    water: 'low',
    drought: 'tolerant',
    'soil water retention': 'well drained, moist',
    'soil texture': 'sandy, loamy',
    'soil ph': 'acid, neutral, alkaline',
    wind: 'No',
    maritime: 'Yes',
    'native range': 'South Europe, West Asia',
    'ecosystem niche': 'Shrub',
    'life cycle': 'perennial',
    'herbaceous or woody': 'woody',
    'deciduous or evergreen': 'evergreen',
    list: [ '' ],
    fertility: 'self fertile',
    'mature measurement unit': 'metres',
    'mature height': '1.2',
    'mature width': '1.2',
    'flower colour': 'blue',
    'grow from': 'seed, cutting',
    'seed requires stratification': 'No',
    'seed requires scarification': 'No',
    'seed requires smokification': 'No',
    'cutting type': 'semi-ripe',
    'edible parts': 'flowers, leaves',
    'edible uses': 'Herb, Salad, Dry',
    template: 'plant' } ]

l0n3star · 2019-09-30T23:13:18Z

Good idea to use partial parses. I will write the tests before completing the parser. This way I'll have clarity.

petteripitkanen · 2019-10-01T15:36:35Z

I could do the parser, it doesn't seem to take that many lines to write a recursive descent parser that produces an AST with the limitation of accepting only inputs where the tokens {{ and }} can be part of template expressions (and not within HTML constructs).

For now I'd probably accept a PR that extends practicalplants.js to include raw wikitext (as explained on previous comment). Your tool could be useful for fetching the raw wikitexts for this PR.

l0n3star · 2019-10-02T03:24:25Z

Sounds fair. I might even pick up the dragon book to understand more on compiler design :)

l0n3star · 2019-10-07T13:03:35Z

I found a parsing library called chevrotain. It lets you define a grammar and generates an AST for you. Mind if I try this out or do you still prefer to write your own parser?

petteripitkanen · 2019-10-09T01:42:41Z

I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers.

You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit.

l0n3star · 2019-10-09T02:37:41Z

I understand. I think it makes sense for powerplant to be further developed then. Thank you for your extremely valuable feedback on my PR's.

…

On Tue, Oct 8, 2019 at 6:42 PM petteripitkanen ***@***.***> wrote: I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers. You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#83?email_source=notifications&email_token=ALAQO34VV4HLH7ZHPIZDA6LQNUZJFA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAWGSJY#issuecomment-539781415>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALAQO367FYC5GYO7DBG5AKDQNUZJFANCNFSM4IMTMGQA> .

petteripitkanen added good first issue practicalplants integration labels Aug 18, 2019

petteripitkanen mentioned this issue Aug 18, 2019

Extract more data from practicalplants.json for the companionship algorithm #20

Closed

petteripitkanen mentioned this issue Aug 25, 2019

Extract properties: edible part and use, medicinal part and use, material part and use #91

Open

petteripitkanen removed the practicalplants integration label Aug 26, 2019

petteripitkanen removed the good first issue label Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool for scraping crop data from practicalplants wiki #83

Tool for scraping crop data from practicalplants wiki #83

petteripitkanen commented Aug 18, 2019

l0n3star commented Aug 24, 2019

l0n3star commented Aug 29, 2019

petteripitkanen commented Aug 30, 2019

l0n3star commented Sep 9, 2019 •

edited

Loading

petteripitkanen commented Sep 11, 2019

l0n3star commented Sep 11, 2019 via email

petteripitkanen commented Sep 18, 2019

l0n3star commented Sep 18, 2019 via email

petteripitkanen commented Sep 22, 2019

l0n3star commented Sep 23, 2019

petteripitkanen commented Sep 24, 2019

petteripitkanen commented Sep 24, 2019

l0n3star commented Sep 30, 2019 •

edited

Loading

petteripitkanen commented Oct 1, 2019

l0n3star commented Oct 2, 2019

l0n3star commented Oct 7, 2019 •

edited

Loading

petteripitkanen commented Oct 9, 2019

l0n3star commented Oct 9, 2019 via email

Tool for scraping crop data from practicalplants wiki #83

Tool for scraping crop data from practicalplants wiki #83

Comments

petteripitkanen commented Aug 18, 2019

l0n3star commented Aug 24, 2019

l0n3star commented Aug 29, 2019

petteripitkanen commented Aug 30, 2019

l0n3star commented Sep 9, 2019 • edited Loading

petteripitkanen commented Sep 11, 2019

l0n3star commented Sep 11, 2019 via email

petteripitkanen commented Sep 18, 2019

l0n3star commented Sep 18, 2019 via email

petteripitkanen commented Sep 22, 2019

l0n3star commented Sep 23, 2019

petteripitkanen commented Sep 24, 2019

petteripitkanen commented Sep 24, 2019

l0n3star commented Sep 30, 2019 • edited Loading

petteripitkanen commented Oct 1, 2019

l0n3star commented Oct 2, 2019

l0n3star commented Oct 7, 2019 • edited Loading

petteripitkanen commented Oct 9, 2019

l0n3star commented Oct 9, 2019 via email

l0n3star commented Sep 9, 2019 •

edited

Loading

l0n3star commented Sep 30, 2019 •

edited

Loading

l0n3star commented Oct 7, 2019 •

edited

Loading