Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reading multiple pdf files with a single PDFParser object #371

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

nicolabaesso
Copy link

Elements changed:

  1. Added new test case in a separate file
  2. Added the two example PDFs
  3. Add the reset of the pages array when the data variable is null

I've added this elements because in my corporate job we are using this library, and recreating everytime the PDFParser object is not something I'm a fan of.
Other test cases are not failing, so no regressions.

pdfparser.js Outdated Show resolved Hide resolved
Copy link
Owner

@modesty modesty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding more tests. A few thoughts on making the instance of PDFJSClass reusable:

  1. pdfparser instance (or the client that instantiates PDFJSClass) needs to be reset/reusable whenever PDFParser is created. (line 107 of pdfparser.js)
  2. lib/pdf.js: setting this.pages=[] is not sufficient to dispose the object, pdfDocument and rawTextContents needs reset too. Recomment to call existing destroy method.

@nicolabaesso
Copy link
Author

Hi @modesty,
thank you for your review. As suggested, I removed the this.pages=[] line and instead called the already available destroy() function.
Also I've added a function to reset the PDFJS object, is this what you were mentioning? Otherwise let me know.
I've also removed the if in line 120 of pdfparser.js, it was a leftover of one test I was doing to understand the code.

@nicolabaesso
Copy link
Author

Hi @modesty,
sorry for the pressure, can you give me a feedback on the code? The next days I could make some changes if something is still wrong (I feel like the function for resetting the PDFJS object could use some more work, but I would like to have your opinion).

Thank you!

@modesty
Copy link
Owner

modesty commented Dec 30, 2024

sorry for the delay. code LGTM. two notes:

  1. could you add a command line option to enable it optionally? default false would keep current clients intact.
  2. could you merge master into this branch and run all tests when ready?

@nicolabaesso
Copy link
Author

sorry for the delay. code LGTM. two notes:

1. could you add a command line option to enable it optionally? default `false` would keep current clients intact.

2. could you merge master into this branch and run all tests when ready?
  1. Sorry, I'm not getting it: how we can provide this feature when using the library via command line? Should it be a whole section where, if the flag is provided, you can pass two or more files and the library reads them with a single instance? Just asking

  2. Master branch merged, I modified the reset part by calling a similar method to the destroy() one, with the exception of not removing the listeners. Keeping the same method hanged the test forever, with this change the test cases are ok and done in less than 5 seconds.

Let me know what to do next, I'll do it as soon as I can.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants