Not able to extract Textract merge cell text properly #72

sravzmum · 2022-06-30T18:15:33Z

Not able to extract the merge cell text properly. There is some issue with combine headers function. Textract not able to extract the top header text properly.

Reference:
t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))
Now let’s iterate through the tables’ content, and extract the data into a DataFrame:

table_index = 1
dataframes = []
def combine_headers(top_h, bottom_h):
bottom_h[3] = top_h[2] + " " + bottom_h[3]
bottom_h[4] = top_h[2] + " " + bottom_h[4]
for page in trp_doc.pages:
for table in page.tables:
table_data = []
headers = table.get_header_field_names() #New Table method to retrieve header column names
if(len(headers)>0): #Let's retain the only table with headers
print("Statememt headers: "+ repr(headers))
top_header= headers[0]
bottom_header = headers[1]
combine_headers(top_header, bottom_header) #The statement has two headers. let's combine them
for r, row in enumerate(table.rows_without_header): #New Table attribute returning rows without headers
table_data.append([])
for c, cell in enumerate(row.cells):
table_data[r].append(cell.mergedText) #New Cell attribute returning merged cells common values
if len(table_data)>0:
df = pd.DataFrame(table_data, columns=bottom_header)

Document table format:

with above logic:

With small changes in the combine header, my issue got solved to some extent:

def combine_headers(top_h, bottom_h):
    for i in range(len(top_h)):
        if bottom_h[i] != top_h[i]:
            bottom_h[i] = top_h[i] + ' ' + bottom_h[i] 
        else :
            bottom_h[i] = bottom_h[i]

But there is some issue with textract top header detection,

The text was updated successfully, but these errors were encountered:

schadem · 2022-07-11T17:50:14Z

We should add an option to pass in a function that can be used instead of the fixed logic.

tb102122 · 2022-07-11T23:21:45Z

@sravzmum are you able to provide a sample document.
I agree with the option at one stage we could even extend it to except custom functions for processing.

prasum · 2022-09-07T17:03:02Z

I am also getting the above issue while merging the top and bottom headers. Some part of the column name in the top header is getting missed in some scenarios. Request your guidance on the same.

tb102122 · 2022-09-07T20:32:07Z

@prasum Do you have a sample document you can share? Do you get the correct results from textract in the ocr step?

prasum · 2022-09-09T17:24:14Z

sorry for the late reply. The sample document I would not be able to share due to internal restrictions. The scenario is same as the above document shared by @sravzmum . Yes I am able to get the correct results from textract in the ocr step

tb102122 · 2022-09-12T00:04:09Z

We would need some sort of example otherwise we cant help.

mukul-llmate · 2024-05-14T12:57:16Z

when i toggle to "merge cells" on the AWS Textract, i get perfect table, but when i download it, or call through api and parse it

it unmerges cells

schadem closed this as completed Jul 11, 2022

schadem reopened this Jul 11, 2022

athewsey added the python Relates to the Python version of TRP label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to extract Textract merge cell text properly #72

Not able to extract Textract merge cell text properly #72

sravzmum commented Jun 30, 2022

schadem commented Jul 11, 2022

tb102122 commented Jul 11, 2022

prasum commented Sep 7, 2022

tb102122 commented Sep 7, 2022 •

edited

Loading

prasum commented Sep 9, 2022 •

edited

Loading

tb102122 commented Sep 12, 2022

mukul-llmate commented May 14, 2024 •

edited

Loading

Not able to extract Textract merge cell text properly #72

Not able to extract Textract merge cell text properly #72

Comments

sravzmum commented Jun 30, 2022

schadem commented Jul 11, 2022

tb102122 commented Jul 11, 2022

prasum commented Sep 7, 2022

tb102122 commented Sep 7, 2022 • edited Loading

prasum commented Sep 9, 2022 • edited Loading

tb102122 commented Sep 12, 2022

mukul-llmate commented May 14, 2024 • edited Loading

tb102122 commented Sep 7, 2022 •

edited

Loading

prasum commented Sep 9, 2022 •

edited

Loading

mukul-llmate commented May 14, 2024 •

edited

Loading