Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: parsing PDF tree felling permits #590

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

electricmonk
Copy link

@electricmonk electricmonk commented Nov 28, 2022

Hi
Following #587, here's a proof of concept that reads from the Tel Aviv city hall website, downloads PDFs and then attempts to parse them.

Currently the results are so-so; only about 90% of the PDFs are text - the rest are scanned images, which would require integrating OCR. Even those that are text, are not deterministically parsed and would require more work to improve the ability to extract reliable data. Currently about 70% of data is parsed, but I'm scoring all fields equally.

The important question before I move forward would be - is this better than nothing? should I invest more time?

@CLAassistant
Copy link

CLAassistant commented Nov 28, 2022

CLA assistant check
All committers have signed the CLA.

@electricmonk electricmonk changed the title Proof of concept: parsing PDF tree licenses Proof of concept: parsing PDF tree permits Nov 28, 2022
@electricmonk electricmonk changed the title Proof of concept: parsing PDF tree permits Proof of concept: parsing PDF tree felling permits Nov 28, 2022
@gruppin
Copy link
Collaborator

gruppin commented Nov 28, 2022

hi, unfortunately this is not better than nothing.
our integrity for our users is to have all the tree licenses, not only part, but they relay on us to notify about every tree license.
so in this sense, notify about partial set of licenses is even worse than not notifying at all, since we might mislead our users with a false image of reality.
I don't think you should invest more time in it.

@electricmonk
Copy link
Author

Although, we can extract 100% of street addresses from the PDF file name, and we can conceivably create a unique id from street address city and publication date. So we won't miss any petition - just have missing data for 10%-30% of them. Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants