Parts of the merged cells text is getting cut off when merged #129

GradyMellin · 2023-03-17T19:17:13Z

When I am merging cells that have text that spans multiple cells, both rows and columns, only the text from the first cell it is in is getting transferred. I am assuming I have to do something like the combine headers function but I am having trouble finding out how to access those other cells. I have added a picture of the table similar to the one that is giving me problems as well as my code and results. Any help with this would be greatly appreciated!

textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])

t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))

table_index = 1
dataframes = []

def combine_headers(top_h, mid_h, bottom_h):
    try:
        bottom_h[4] = top_h[4] + " " + mid_h[4] + " " + bottom_h[4]
        bottom_h[5] = top_h[4] + " " + mid_h[4] + " " + bottom_h[5]
    except:
        pass

for page in trp_doc.pages:
    for table in page.tables:
        table_data = []
        headers = table.get_header_field_names()
        if(len(headers)>0):                                      
            print("Statememt headers: "+ repr(headers))
            top_header= headers[0]
            middle_header = headers[1]
            bottom_header = headers[2]
            combine_headers(top_header, middle_header, bottom_header)   
            for r, row in enumerate(table.rows_without_header): 
                table_data.append([])
                for c, cell in enumerate(row.cells):
                    table_data[r].append(cell.mergedText)  
            
            if len(table_data)>0:
                df = pd.DataFrame(table_data, columns=bottom_header)
    print(df.to_markdown())

Table:

As you can see below, in the headers, after "Local (Up" gets cut off because it runs into the next cell, the same happens with all of the length class rows they cut off the "pages)" part of that row. It also happens with the extra long books part.
Results:

	Length Class	Category Class	Codes	Codes	Distribution Local (Up To Mark Up Factor	Distribution Local (Up To Cost Factor
0	Short Books (0 100	Children's	Non-fiction Fiction	011-- 012--	1.10	1.00
1	Short Books (0 100	Mystery	Non-fiction Fiction	021-- 022--	1.55	1.15
2	Short Books (0 100	Romance	Non-fiction Fiction	031-- 032--	1.40	1.00
3
4	Medium Books (101 500	Children's	Non-fiction Fiction	211-- 212--	1.05	0.95
5	Medium Books (101 500	Mystery	Non-fiction Fiction	221-- 222--	1.50	0.70
6	Medium Books (101 500	Romance	Non-fiction Fiction	231-- 232--	1.40	0.75
7
8	Long Books (501 - 1,000	Children's	Non-fiction Fiction	311-- 312--	1.10	0.65
9	Long Books (501 - 1,000	Mystery	Non-fiction Fiction	321-- 322--	1.55	0.90
10	Long Books (501 - 1,000	Romance	Non-fiction Fiction	331-- 332--	1.25	0.70
11
12	Extra-Long (Over 1,000	Extra-Long (Over 1,000	Non-fiction Fiction	401-- 402--	2.45	1.15
13

The text was updated successfully, but these errors were encountered:

pranavbhat12 · 2024-01-08T11:21:36Z

Hey @GradyMellin I am also facing the same issue.Did you get any workaround for this?

athewsey added the python Relates to the Python version of TRP label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parts of the merged cells text is getting cut off when merged #129

Parts of the merged cells text is getting cut off when merged #129

GradyMellin commented Mar 17, 2023

pranavbhat12 commented Jan 8, 2024

Parts of the merged cells text is getting cut off when merged #129

Parts of the merged cells text is getting cut off when merged #129

Comments

GradyMellin commented Mar 17, 2023

pranavbhat12 commented Jan 8, 2024