Unable to parse text from a valid pdf #363

carlosrafp · 2024-09-09T11:53:13Z

Trying to extract the text from the following pdf gives wrong characters.

Fragment extracted from the 1st page:
䴀䄀吀䔀刀䤀䄀䰀 ⴀ 匀䄀一䜀唀䔀
䠀䔀䴀伀䜀刀䄀䴀䄀

Code used to extract text:
async function extractPageTextsFromPdf(pdfBuffer: Buffer): Promise<string[]> { const pdfParser = new PDFParser(null, true, ''); function decodePdfPageTexts(texts: Text[]) { return decodeURIComponent( texts.map((t) =>t.R.map((tt) => tt.T).join(' ')).join(' ') ); } const texts: Promise<string[]> = new Promise((resolve, reject) => { pdfParser.on('pdfParser_dataReady', (pdfData) => { const { Pages: pages } = pdfData; const tx = pages.map((p) => p.Texts).map(decodePdfPageTexts); resolve(tx); }); pdfParser.on('pdfParser_dataError', (errData) => { reject(errData.parserError); }); pdfParser.parseBuffer(pdfBuffer); }); return texts; }

The text was updated successfully, but these errors were encountered:

modesty · 2024-12-30T01:16:02Z

The text's font, T3Font_0, is Type 3 font and custom encoding, not supported, same as issue #377
Two options to move forward:

submit PR to support type 3 font rendering in canvas.js
recreate the PDF with standard TrueType font and standard encoding

modesty added not-supported help-wanted: type3-font labels Dec 30, 2024

modesty mentioned this issue Dec 30, 2024

Bad parsing of PDF file #377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to parse text from a valid pdf #363

Unable to parse text from a valid pdf #363

carlosrafp commented Sep 9, 2024

modesty commented Dec 30, 2024 •

edited

Loading

Unable to parse text from a valid pdf #363

Unable to parse text from a valid pdf #363

Comments

carlosrafp commented Sep 9, 2024

modesty commented Dec 30, 2024 • edited Loading

modesty commented Dec 30, 2024 •

edited

Loading