Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad parsing of PDF file #377

Open
gladykov opened this issue Dec 27, 2024 · 1 comment
Open

Bad parsing of PDF file #377

gladykov opened this issue Dec 27, 2024 · 1 comment

Comments

@gladykov
Copy link
Contributor

gladykov commented Dec 27, 2024

html.pdf

Title of this PDF:

Sodalitas delectus ipsum aperio facere.

is extracted as

4PEBMJUBTEFMFDUVTJQTVNBQFSJPGBDFSF

This PDF was exported from Confluence by Atlassian.

pdfinfo

Title:           Sodalitas delectus ipsum aperio facere. - test-automation - Confluence
Creator:         Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/131.0.0.0 Safari/537.36
Producer:        Skia/PDF m131
CreationDate:    Fri Dec 27 07:20:30 2024 -03
ModDate:         Fri Dec 27 07:20:30 2024 -03
Custom Metadata: no
Metadata Stream: no
Tagged:          yes
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           1
Encrypted:       no
Page size:       612 x 792 pts (letter)
Page rot:        0
File size:       15981 bytes
Optimized:       no
PDF version:     1.4

Used method

export async function parsePDF(filepath: string) {
    // https://github.com/modesty/pdf2json
    let parsed = false;

    /* eslint-disable-next-line */
    const pdfParser = new PDFParser(this, true);

    /* eslint-disable-next-line */
    pdfParser.on('pdfParser_dataError', (errData) => console.error(errData.parserError));
    /* eslint-disable-next-line */
    pdfParser.on('pdfParser_dataReady', (_) => {
        parsed = true;
    });

    /* eslint-disable-next-line */
    await pdfParser.loadPDF(filepath);

    let i = 0;
    const max = 5;
    while (!parsed && i < max) {
        await sleep(1, 'Waiting for parsed PDF');
        i += 1;
    }

    if (i === max && !parsed) {
        throw new Error('Timeout while waiting for parsed PDF');
    }

    /* eslint-disable-next-line */
    return unixifyLineEndings(pdfParser.getRawTextContent());
}
@modesty
Copy link
Owner

modesty commented Dec 30, 2024

The first line of text in the sample PDF uses type 3 font and custom encoding, which is not supported at this point, same as issue #363.
Two options to move forward:

  1. submit PR to support type 3 font rendering in canvas.js
  2. recreate the PDF with standard TrueType font and standard encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants