Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rows from table in PDF return as one unbroken string (no indivdual cells) #28

Open
dmeekerpcg opened this issue Aug 28, 2023 · 0 comments

Comments

@dmeekerpcg
Copy link

When I use the old tabula-java, it will split the cells out of the table but it is not working in tabula-sharp, I just get a whole row/line without individual data broken out. Maybe this is because the table is non-uniform? (different column counts on different rows)

Example table (cannot attach PDF as it has personal info)
TableExample

I am using the latest version of PDFPig but that didn't seem to work. See example code below, maybe i'm doing something wrong with the syntax, just trying to iterate through the row

 using (PdfDocument document = PdfDocument.Open(path, new ParsingOptions() { ClipPaths = false }))
        {
            ObjectExtractor oe = new ObjectExtractor(document);
            PageArea page = oe.Extract(Page);

            // detect canditate table zones
            SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
            var regions = detector.Detect(page);

            IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
            List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
            var table = tables[0];
            var rows = table.Rows;

            string result = "";
            string test = rows[0][0].GetText(); // <---- testing first cell
            Run.PrintLog("Test: " + test);

            foreach (var r in rows)
            {
                foreach (RectangularTextContainer txt in r)
                {
                    result += txt.GetText() + "|";   //<---- for each cell (?)
                }
                result += System.Environment.NewLine;
            }
            Run.PrintLog("Tab result: " + result);
        }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant