Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to extract Textract merge cell text properly #72

Open
sravzmum opened this issue Jun 30, 2022 · 7 comments
Open

Not able to extract Textract merge cell text properly #72

sravzmum opened this issue Jun 30, 2022 · 7 comments
Labels
python Relates to the Python version of TRP

Comments

@sravzmum
Copy link

Not able to extract the merge cell text properly. There is some issue with combine headers function. Textract not able to extract the top header text properly.

Reference:
t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))
Now let’s iterate through the tables’ content, and extract the data into a DataFrame:

table_index = 1
dataframes = []
def combine_headers(top_h, bottom_h):
bottom_h[3] = top_h[2] + " " + bottom_h[3]
bottom_h[4] = top_h[2] + " " + bottom_h[4]
for page in trp_doc.pages:
for table in page.tables:
table_data = []
headers = table.get_header_field_names() #New Table method to retrieve header column names
if(len(headers)>0): #Let's retain the only table with headers
print("Statememt headers: "+ repr(headers))
top_header= headers[0]
bottom_header = headers[1]
combine_headers(top_header, bottom_header) #The statement has two headers. let's combine them
for r, row in enumerate(table.rows_without_header): #New Table attribute returning rows without headers
table_data.append([])
for c, cell in enumerate(row.cells):
table_data[r].append(cell.mergedText) #New Cell attribute returning merged cells common values
if len(table_data)>0:
df = pd.DataFrame(table_data, columns=bottom_header)

Document table format:
image

with above logic:
image

With small changes in the combine header, my issue got solved to some extent:

def combine_headers(top_h, bottom_h):
    for i in range(len(top_h)):
        if bottom_h[i] != top_h[i]:
            bottom_h[i] = top_h[i] + ' ' + bottom_h[i] 
        else :
            bottom_h[i] = bottom_h[i]

But there is some issue with textract top header detection,
image

@schadem
Copy link
Contributor

schadem commented Jul 11, 2022

We should add an option to pass in a function that can be used instead of the fixed logic.

@schadem schadem closed this as completed Jul 11, 2022
@schadem schadem reopened this Jul 11, 2022
@tb102122
Copy link
Contributor

@sravzmum are you able to provide a sample document.
I agree with the option at one stage we could even extend it to except custom functions for processing.

@prasum
Copy link

prasum commented Sep 7, 2022

I am also getting the above issue while merging the top and bottom headers. Some part of the column name in the top header is getting missed in some scenarios. Request your guidance on the same.

@tb102122
Copy link
Contributor

tb102122 commented Sep 7, 2022

@prasum Do you have a sample document you can share? Do you get the correct results from textract in the ocr step?

@prasum
Copy link

prasum commented Sep 9, 2022

sorry for the late reply. The sample document I would not be able to share due to internal restrictions. The scenario is same as the above document shared by @sravzmum . Yes I am able to get the correct results from textract in the ocr step

@tb102122
Copy link
Contributor

We would need some sort of example otherwise we cant help.

@athewsey athewsey added the python Relates to the Python version of TRP label Aug 28, 2023
@mukul-llmate
Copy link

mukul-llmate commented May 14, 2024

4c32d660-37af-4ad6-80e8-56695084c828fig_abf2fd2d-c49c-4fd2-8640-5cdcd1949a9f

image

when i toggle to "merge cells" on the AWS Textract, i get perfect table, but when i download it, or call through api and parse it

it unmerges cells

Screenshot from 2024-05-14 18-25-55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Relates to the Python version of TRP
Projects
None yet
Development

No branches or pull requests

6 participants