Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool for scraping crop data from practicalplants wiki #83

Open
petteripitkanen opened this issue Aug 18, 2019 · 18 comments
Open

Tool for scraping crop data from practicalplants wiki #83

petteripitkanen opened this issue Aug 18, 2019 · 18 comments

Comments

@petteripitkanen
Copy link
Collaborator

Create a tool for scraping crop data from practicalplants wiki. Try to divide it into functions that could also be useful for #80. The output format should be like practicalplants.json, or a symmetrical plain JS file (#70).

Currently practicalplants.json is missing some data for the properties edible part and use, medicinal part and use, material part and use, the tool should handle these correctly.

The tool should be written in JS but it can be based on the Python tool that was originally used for generating practicalplants.json.

@l0n3star
Copy link
Contributor

@petteripitkanen I'll take this one.

@l0n3star
Copy link
Contributor

@petteripitkanen I'm about half way there. I do have one question. I'm going to dump the data in a database. I'm debating between mongodb and postgres. Any thoughts? One plus point with postgres is I can run SQL queries against it. I also don't see an issue with devising a schema.

@petteripitkanen
Copy link
Collaborator Author

The main output format should be a plain JS file, the one that is currently used. I don't see benefit of dumping the data to a database, but having multiple output formats seems okay to me.

@l0n3star
Copy link
Contributor

l0n3star commented Sep 9, 2019

@petteripitkanen I have the tool ready for use. Note it only does JSON but I'm now adding support for other formats. I just wanted you to get a first look. Any and all feedback welcome. I tried it and the file size is 23M (current one is 17M). I spot checked a few plants and it has all uses. My repo is here: https://github.com/l0n3star/scraper

@petteripitkanen
Copy link
Collaborator Author

Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully git diff is clear enough when the crops are sorted to the same order).

  • If processPlantContent fails for a crop then this crop is not included in the output. There can easily be connection problems during a run, so a retry mechanism is needed.
  • It looks like some crops don't include property binomial, for example Rosmarinus officinalis, but then title seems to contain the binomial name. There could be logic for this: if binomial is not included, fill it with data from title?
  • Property functions is always a string while it should be an array of objects that have the function property.
  • It looks like forage should also be an array, currently it is a string, and sometimes it contains }} at the end.
  • It would be good to double-check if other known array properties are parsed correctly.
  • It would be good to double-check if there are more properties that are missing data when compared to raw MediaWiki crop data, maybe there are other properties that we have partly missed before?
  • It'd be clearer to have one async function that fetches the crops and returns an array of JS objects, this way the output step could be separated from the fetching step, and it would be easier to add support for different output formats.
  • It'd be ok to use a MediaWiki parser library if there is one available that seems suitable, though the parsing logic don't seem to be that complex so I think it is fine to go with fixing the hand-written parser as well.

@l0n3star
Copy link
Contributor

l0n3star commented Sep 11, 2019 via email

@petteripitkanen
Copy link
Collaborator Author

I did some related debugging in #97, it seems that raw (practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays.

It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step.

Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing:

  • edible part and use
  • medicinal part and use
  • material part and use
  • toxic parts
  • functions
  • shelter
  • forage
  • crops
  • subspecies
  • cultivar groups
  • ungrouped cultivars

@l0n3star
Copy link
Contributor

l0n3star commented Sep 18, 2019 via email

@petteripitkanen
Copy link
Collaborator Author

I have taken a look at available JS parsers, for instance infobox-parser and wtf_wikipedia (this is Wikipedia-specific but it is possible to trick the parser by changing the template name Plant to Infobox element), and it seems that none of these handle nested templates properly.

Actually tackling the general case even for a single template expression seems to require a full wikitext parser, and basic wikitext is not the easiest to parse to start with, and then it is interleaved with HTML and whatnot, so the task of parsing wikitext to a data structure is quite complex. MediaWiki itself doesn't seem to have such a parser, only code to convert wikitext to HTML for display.

Our case is a bit easier as the practicalplants.org data for the Plant template is quite regular with only small amount of HTML and other irregularities. I have preliminary plans for powerplant to be able to use practicalplants.org for automatically populating a local MediaWiki instance, and then synchronizing the local MediaWiki with powerplant, allowing one to edit and browse the crop collection with MediaWiki, so for this I'd like to have a special practicalplants wikitext parser in powerplant.

For this the way to go could be to:

  1. Extend the PracticalplantsCrop dataset in db/practicalplants-data.js to also contain the raw wikitext of the Plant template. This could be done in one PR.
  2. Start writing a parser and test it using the raw wikitext data. There are already some tests written in Create modules for Crop and PracticalplantsCrop types #97. The parser could start incomplete and be iteratively improved through multiple PRs.
  3. Once the parser passes all tests, update db/practicalplants-data.js with the parsed objects.

@l0n3star
Copy link
Contributor

I'll get started on adding raw wikitext.

@petteripitkanen
Copy link
Collaborator Author

The extended structure could look like this [{ wikitext: String, object: PracticalplantsCrop }, ...], and then there could be two functions in db/practicalplants-data.js, getCrops for getting the parsed objects and getWikitextObjectPairs for getting the whole data structure. With this structure it would be easy to compare wikitext and parsed objects in a diff.

@petteripitkanen
Copy link
Collaborator Author

While there don't seem to be a JS parser available that generates a complete AST of the nested template structure, it might be useful to use the partial parses and fill in details, to significantly ease the remaining parsing process.

Input:

{{Plant
|common=Rosemary
|family=Lamiaceae
|primary image=Rosmarinus officinalis.jpg
|forage={{Plant provides forage for|forage=Bees}}
|edible part and use={{Has part with edible use
|part used=Leaves
|part used for=Herbs
}}{{Has part with edible use
|part used=Leaves
|part used for=Dried
}}{{Has part with edible use
|part used=Flowers
|part used for=Salads
}}
|material part and use=
|medicinal part and use=
|sun=full sun
|shade=no shade
|hardiness zone=7
|heat zone=
|water=low
|drought=tolerant
|soil water retention=well drained, moist
|soil texture=sandy, loamy
|soil ph=acid, neutral, alkaline
|wind=No
|maritime=Yes
|native range=South Europe, West Asia
|ecosystem niche=Shrub
|life cycle=perennial
|herbaceous or woody=woody
|deciduous or evergreen=evergreen
|
|fertility=self fertile
|mature measurement unit=metres
|mature height=1.2
|mature width=1.2
|flower colour=blue
|grow from=seed, cutting
|seed requires stratification=No
|seed dormancy depth=
|seed requires scarification=No
|seed requires smokification=No
|cutting type=semi-ripe
|bulb type=
|graft rootstock=
|edible parts=flowers, leaves
|edible uses=Herb, Salad, Dry
}}

Output of infobox-parser(input):

{ general:
   { common: 'Rosemary',
     family: 'Lamiaceae',
     primaryImage: 'Rosmarinus officinalis.jpg',
     forage: 'Plant provides forage for',
     ediblePartAndUse: 'Has part with edible use',
     partUsed: 'Flowers',
     partUsedFor: 'Salads',
     sun: 'full sun',
     shade: 'no shade',
     hardinessZone: '7',
     water: 'low',
     drought: 'tolerant',
     soilWaterRetention: 'well drained, moist',
     soilTexture: 'sandy, loamy',
     soilPh: 'acid, neutral, alkaline',
     wind: 'No',
     maritime: 'Yes',
     nativeRange: 'South Europe, West Asia',
     ecosystemNiche: 'Shrub',
     lifeCycle: 'perennial',
     herbaceousOrWoody: 'woody',
     deciduousOrEvergreen: 'evergreen',
     fertility: 'self fertile',
     matureMeasurementUnit: 'metres',
     matureHeight: '1.2',
     matureWidth: '1.2',
     flowerColour: 'blue',
     growFrom: 'seed, cutting',
     seedRequiresStratification: 'No',
     seedRequiresScarification: 'No',
     seedRequiresSmokification: 'No',
     cuttingType: 'semi-ripe',
     edibleParts: 'flowers, leaves',
     edibleUses: 'Herb, Salad, Dry' },
  tables: [],
  lists: [] }

Output of wtf_wikipedia(input).templates():

[ { forage: 'Bees', template: 'plant provides forage for' },
  { 'part used': 'Leaves',
    'part used for': 'Herbs',
    template: 'has part with edible use' },
  { 'part used': 'Leaves',
    'part used for': 'Dried',
    template: 'has part with edible use' },
  { 'part used': 'Flowers',
    'part used for': 'Salads',
    template: 'has part with edible use' },
  { common: 'Rosemary',
    family: 'Lamiaceae',
    'primary image': 'Rosmarinus officinalis.jpg',
    sun: 'full sun',
    shade: 'no shade',
    'hardiness zone': '7',
    water: 'low',
    drought: 'tolerant',
    'soil water retention': 'well drained, moist',
    'soil texture': 'sandy, loamy',
    'soil ph': 'acid, neutral, alkaline',
    wind: 'No',
    maritime: 'Yes',
    'native range': 'South Europe, West Asia',
    'ecosystem niche': 'Shrub',
    'life cycle': 'perennial',
    'herbaceous or woody': 'woody',
    'deciduous or evergreen': 'evergreen',
    list: [ '' ],
    fertility: 'self fertile',
    'mature measurement unit': 'metres',
    'mature height': '1.2',
    'mature width': '1.2',
    'flower colour': 'blue',
    'grow from': 'seed, cutting',
    'seed requires stratification': 'No',
    'seed requires scarification': 'No',
    'seed requires smokification': 'No',
    'cutting type': 'semi-ripe',
    'edible parts': 'flowers, leaves',
    'edible uses': 'Herb, Salad, Dry',
    template: 'plant' } ]

@l0n3star
Copy link
Contributor

l0n3star commented Sep 30, 2019

Good idea to use partial parses. I will write the tests before completing the parser. This way I'll have clarity.

@petteripitkanen
Copy link
Collaborator Author

I could do the parser, it doesn't seem to take that many lines to write a recursive descent parser that produces an AST with the limitation of accepting only inputs where the tokens {{ and }} can be part of template expressions (and not within HTML constructs).

For now I'd probably accept a PR that extends practicalplants.js to include raw wikitext (as explained on previous comment). Your tool could be useful for fetching the raw wikitexts for this PR.

@l0n3star
Copy link
Contributor

l0n3star commented Oct 2, 2019

Sounds fair. I might even pick up the dragon book to understand more on compiler design :)

@l0n3star
Copy link
Contributor

l0n3star commented Oct 7, 2019

I found a parsing library called chevrotain. It lets you define a grammar and generates an AST for you. Mind if I try this out or do you still prefer to write your own parser?

@petteripitkanen
Copy link
Collaborator Author

I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers.

You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit.

@l0n3star
Copy link
Contributor

l0n3star commented Oct 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants