pdfreader and bamboohr paycheck importer #94

a0js · 2024-03-13T08:09:54Z

Purpose

Many paycheck downloads only come in the form of pdf. To support this use case, and potentially other use cases, we need to add support for pdf parsing. As a proof of concept, I also added the bambooHR paystub importer.

Approach

I mirrored the csv_multiple_reader.py as best I could and ended up primarily writing out the read_file method.
I added a debug option to print out some helpful hints when using the pdfreader.
For the BambooHR, I mostly copied the workday importer and added the few extra settings needed for pdf parsing.

Outstanding Issues

Tests
- I can't really upload my own paystub as a test. I've tried asking BambooHR for a demo paystub pdf, but they just redirected me to their website that has a png version of the file. I actually have a friend that works at BambooHR, so I'm reaching out to see if he can send me a demo pdf that I could upload. That said, I do have it working on my machine, (I know that really isn't good enough, but it is something). I could also write another importer for adp, and may be able to snag a sample pdf for them, maybe.
Dependencies
- I mostly developed this in my own beancount setup, so I'm not sure how to add the necessary dependencies (pandas, pdfplumber, & dateparser). Could you help me figure out how to add those dependencies to the repo, if this is something you want to move forward with?

Resolves #93

redstreet · 2024-03-15T06:24:01Z

Looks great, and I'd be happy to take this in. Thanks for the helpful comments and the debug option too.

I completely understand the challenge in adding any kind of examples at all to importers since by definition they're all personal data. But I'm wondering if perhaps adding an example pdf that you even made up in any word processing program that is importable by this? I just want to ensure we have some way down the line to test code to ensure it's still alive.

It doesn't have to use the bamboohr importer (though that'd be great too). Perhaps a simple single table made in MS-Word or the likes, and a simple dummy importer for it?

redstreet · 2024-03-15T06:25:16Z

For dependencies, the following should work:

pip3 install pigar
pigar generate -c '>='

ranebrown · 2024-03-28T12:58:59Z

beancount_reds_importers/libreader/pdfreader.py

+
+        self.alltables = {}
+        for table in tables:
+            self.alltables[table['section']] = etl.fromdataframe(pd.DataFrame(table['table'][1:], columns=table['table'][0]))


I gave this a try and the table extraction worked for a PDF with 10 separate tables. If you want to avoid the dependency on pandas this could be changed to something like this:

for table in tables: t = table["table"] # transpose table to use fromcolumns tbl_t = [[t[j][i] for j in range(len(t))] for i in range(len(t[0]))] self.alltables[table["section"]] = etl.fromcolumns(tbl_t)

If we can have it work without pandas that would be great. I'll implement you suggestion and try to get a test case up before the end of the week.

Turns out we can just run etl.wrap(table['table']) rather than preprocessing it.

ranebrown · 2024-03-28T13:00:42Z

I think the dependencies would just need added here

a0js · 2024-03-28T19:50:43Z

Apologies for not getting back to this. I'm hoping to make the changes and add a test before the end of the week.

a0js · 2024-03-30T23:36:32Z

requirements.txt

I'm not sure why pigar changed the file so much. It appears to still work, though.

It's moved all versions forward based on what is installed on your system. That's fine, I wouldn't worry about it much. requirements.txt is used only to setup development environments, including by the github action to build the beancount_reds_importers package.

The package's dependencies for users who install it (i.e., for regular users) are in setup.py which you will have to update manually. See install_requires under it.

I think the build is failing because pigar removed some dependencies. I'll look into it tomorrow and see what I can do.

a0js · 2024-03-30T23:38:26Z

I added a generic pdf paycheck importer that really isn't meant to be used but to show how to make a paycheck importer using the pdfreader. Tests are included and passing.

beancount_reds_importers/importers/genericpdf/tests/genericpdf_test.py

redstreet · 2024-03-31T01:32:31Z

This looks great thank you, and thanks for the test! Looks like the package build fails. If you could please take a look and fix that, we can get this merged in.

a0js · 2024-03-31T05:46:02Z

I think the dependencies would just need added here

I forgot to do this. I'll update it tomorrow!

a0js · 2024-04-09T18:15:31Z

I wasn't able to get to this before I went off grid for a week. I'll try and update in the next few days.

I'm not sure what happened, but pigar would not detect imports from other tests. So I manually updated the requirements.txt to include all needed files.

a0js · 2024-04-14T17:38:01Z

@redstreet Not sure why, but I could not get pigar to find all the dependencies necessary, so I just manually updated the requirements.txt file. That should fix the tests.

redstreet · 2024-04-15T17:40:11Z

Thanks, no idea why pigar fails, but updating requirements.txt is fine.

The formatting is still failing, see above. If you could fix this, we can get this in. https://github.com/redstreet/beancount_reds_importers/blob/main/CONTRIBUTING.md shows what to run to fix formatting.

a0js · 2024-04-16T16:53:45Z

Not sure why, but when I ran the formatting commands locally I got 64 changes instead of 3 as observed in the github action output. I'm working on a mac and for some reason the default ruff settings are different. I ended up running a docker image that copied the github action environment and ran the format commands inside docker and that got the correct formatting changes.

Thanks for your patience, by the way.

redstreet · 2024-04-16T16:54:03Z

@a0js, curious, is there a button to run the formatting workflow that appears for you in this PR? I made a commit yesterday to enable this and am wondering if it works.

redstreet · 2024-04-16T17:07:27Z

Not sure why, but when I ran the formatting commands locally I got 64 changes instead of 3 as observed in the github action output. I'm working on a mac and for some reason the default ruff settings are different. I ended up running a docker image that copied the github action environment and ran the format commands inside docker and that got the correct formatting changes.

Thanks for reporting. Hmm, not sure why you're seeing this. Ruff uses settings frm pyproject.toml that's in the repo. So if you're running from the repo root and that file is present, it should come from there. Perhaps try a ruff --version, and a pip install ruff --ugprade if needed?

Thanks for your patience, by the way.

Of course, no worries at all, thank you for sticking with this PR and getting it in, much appreciated!

a0js · 2024-04-18T00:04:55Z

@a0js, curious, is there a button to run the formatting workflow that appears for you in this PR? I made a commit yesterday to enable this and am wondering if it works.

I don't see anything on the PR page, but I could be looking in the wrong place. Where should it be?

redstreet · 2024-04-18T00:24:34Z

I don't see anything on the PR page, but I could be looking in the wrong place. Where should it be?

I'd expect it to be on the bottom of the PR page. I'm sure it'd be an obvious green button, so if you haven't seen it, it must've not worked. Anyway, it doesn't matter much for this PR, thanks for checking!

Do let me know if you need help with any of the outstanding things.

a0js · 2024-04-19T16:55:48Z

I can't merge this myself as I don't have write access. If you think this is good to go, can you merge it in?

redstreet · 2024-04-19T18:43:17Z

I think it was failing checks, but the checks are not running now for some reason. Let me take a look.

redstreet · 2024-04-19T18:48:37Z

Checks are passing for me locally. I don't know why github wouldn't run them. Either way, merged!

Thank you again for the contribution, and for working to get this PR through! IMO, table extraction from pdfs is a solid contribution for beancount_reds_importers as it's still fairly common to find that pdfs are the only option (no csv/ofx). So this is great!

a0js · 2024-04-20T18:56:59Z

You're most welcome! I'm glad I could add something to this awesome project. Let me know if there are some other features I can help with.

a0js · 2024-05-14T02:51:53Z

Sorry to comment on the old PR, but I was just curious how often you cut releases and when this one might be published?

redstreet · 2024-05-14T03:59:06Z

Np at all. I usually put it through personal use of at least a few weeks before I publish, so bugs have a chance to surface. Let me take a look this evening to see if I can cut a release.

redstreet · 2024-05-14T08:24:23Z

Released 0.9.0, featuring this PR :-)

a0js added 2 commits March 13, 2024 01:53

feat: add pdfreader libreader importer

5599e6b

feat: add bamboohr paycheck importer

041e006

ranebrown reviewed Mar 28, 2024

View reviewed changes

feat: add genericpdf paycheck importer

b93a854

a0js commented Mar 30, 2024

View reviewed changes

a0js requested a review from ranebrown March 30, 2024 23:38

a0js commented Mar 31, 2024

View reviewed changes

beancount_reds_importers/importers/genericpdf/tests/genericpdf_test.py Show resolved Hide resolved

a0js added 2 commits April 14, 2024 11:33

fix: update requirements to add back lost packages

4469095

I'm not sure what happened, but pigar would not detect imports from other tests. So I manually updated the requirements.txt to include all needed files.

Merge branch 'main' into pdfreader-and-bamboohr-importer

f504b84

a0js requested a review from redstreet April 15, 2024 16:26

redstreet and others added 2 commits April 16, 2024 10:48

ci: enable workflows to run automatically on PRs

6432b20

chore: formatting

ebbcfeb

redstreet approved these changes Apr 19, 2024

View reviewed changes

redstreet merged commit 230f755 into redstreet:main Apr 19, 2024

a0js deleted the pdfreader-and-bamboohr-importer branch April 24, 2024 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfreader and bamboohr paycheck importer #94

pdfreader and bamboohr paycheck importer #94

a0js commented Mar 13, 2024 •

edited

Loading

redstreet commented Mar 15, 2024 •

edited

Loading

redstreet commented Mar 15, 2024

ranebrown Mar 28, 2024

a0js Mar 28, 2024

a0js Mar 30, 2024

ranebrown commented Mar 28, 2024

a0js commented Mar 28, 2024

a0js Mar 30, 2024

redstreet Mar 31, 2024 •

edited

Loading

a0js Mar 31, 2024

a0js commented Mar 30, 2024

redstreet commented Mar 31, 2024

a0js commented Mar 31, 2024

a0js commented Apr 9, 2024

a0js commented Apr 14, 2024

redstreet commented Apr 15, 2024

a0js commented Apr 16, 2024 •

edited

Loading

redstreet commented Apr 16, 2024

redstreet commented Apr 16, 2024

a0js commented Apr 18, 2024

redstreet commented Apr 18, 2024

a0js commented Apr 19, 2024

redstreet commented Apr 19, 2024

redstreet commented Apr 19, 2024

a0js commented Apr 20, 2024

a0js commented May 14, 2024

redstreet commented May 14, 2024

redstreet commented May 14, 2024

pdfreader and bamboohr paycheck importer #94

pdfreader and bamboohr paycheck importer #94

Conversation

a0js commented Mar 13, 2024 • edited Loading

Purpose

Approach

Outstanding Issues

redstreet commented Mar 15, 2024 • edited Loading

redstreet commented Mar 15, 2024

ranebrown Mar 28, 2024

Choose a reason for hiding this comment

a0js Mar 28, 2024

Choose a reason for hiding this comment

a0js Mar 30, 2024

Choose a reason for hiding this comment

ranebrown commented Mar 28, 2024

a0js commented Mar 28, 2024

a0js Mar 30, 2024

Choose a reason for hiding this comment

redstreet Mar 31, 2024 • edited Loading

Choose a reason for hiding this comment

a0js Mar 31, 2024

Choose a reason for hiding this comment

a0js commented Mar 30, 2024

redstreet commented Mar 31, 2024

a0js commented Mar 31, 2024

a0js commented Apr 9, 2024

a0js commented Apr 14, 2024

redstreet commented Apr 15, 2024

a0js commented Apr 16, 2024 • edited Loading

redstreet commented Apr 16, 2024

redstreet commented Apr 16, 2024

a0js commented Apr 18, 2024

redstreet commented Apr 18, 2024

a0js commented Apr 19, 2024

redstreet commented Apr 19, 2024

redstreet commented Apr 19, 2024

a0js commented Apr 20, 2024

a0js commented May 14, 2024

redstreet commented May 14, 2024

redstreet commented May 14, 2024

a0js commented Mar 13, 2024 •

edited

Loading

redstreet commented Mar 15, 2024 •

edited

Loading

redstreet Mar 31, 2024 •

edited

Loading

a0js commented Apr 16, 2024 •

edited

Loading