Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parts of the merged cells text is getting cut off when merged #129

Open
GradyMellin opened this issue Mar 17, 2023 · 1 comment
Open

Parts of the merged cells text is getting cut off when merged #129

GradyMellin opened this issue Mar 17, 2023 · 1 comment
Labels
python Relates to the Python version of TRP

Comments

@GradyMellin
Copy link

When I am merging cells that have text that spans multiple cells, both rows and columns, only the text from the first cell it is in is getting transferred. I am assuming I have to do something like the combine headers function but I am having trouble finding out how to access those other cells. I have added a picture of the table similar to the one that is giving me problems as well as my code and results. Any help with this would be greatly appreciated!

textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])

t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))

table_index = 1
dataframes = []

def combine_headers(top_h, mid_h, bottom_h):
    try:
        bottom_h[4] = top_h[4] + " " + mid_h[4] + " " + bottom_h[4]
        bottom_h[5] = top_h[4] + " " + mid_h[4] + " " + bottom_h[5]
    except:
        pass

for page in trp_doc.pages:
    for table in page.tables:
        table_data = []
        headers = table.get_header_field_names()
        if(len(headers)>0):                                      
            print("Statememt headers: "+ repr(headers))
            top_header= headers[0]
            middle_header = headers[1]
            bottom_header = headers[2]
            combine_headers(top_header, middle_header, bottom_header)   
            for r, row in enumerate(table.rows_without_header): 
                table_data.append([])
                for c, cell in enumerate(row.cells):
                    table_data[r].append(cell.mergedText)  
            
            if len(table_data)>0:
                df = pd.DataFrame(table_data, columns=bottom_header)
    print(df.to_markdown())

Table:
Screenshot` (196)

As you can see below, in the headers, after "Local (Up" gets cut off because it runs into the next cell, the same happens with all of the length class rows they cut off the "pages)" part of that row. It also happens with the extra long books part.
Results:

Length Class Category Class Codes Codes Distribution Local (Up To Mark Up Factor Distribution Local (Up To Cost Factor
0 Short Books (0 100 Children's Non-fiction Fiction 011-- 012-- 1.10 1.00
1 Short Books (0 100 Mystery Non-fiction Fiction 021-- 022-- 1.55 1.15
2 Short Books (0 100 Romance Non-fiction Fiction 031-- 032-- 1.40 1.00
3
4 Medium Books (101 500 Children's Non-fiction Fiction 211-- 212-- 1.05 0.95
5 Medium Books (101 500 Mystery Non-fiction Fiction 221-- 222-- 1.50 0.70
6 Medium Books (101 500 Romance Non-fiction Fiction 231-- 232-- 1.40 0.75
7
8 Long Books (501 - 1,000 Children's Non-fiction Fiction 311-- 312-- 1.10 0.65
9 Long Books (501 - 1,000 Mystery Non-fiction Fiction 321-- 322-- 1.55 0.90
10 Long Books (501 - 1,000 Romance Non-fiction Fiction 331-- 332-- 1.25 0.70
11
12 Extra-Long (Over 1,000 Extra-Long (Over 1,000 Non-fiction Fiction 401-- 402-- 2.45 1.15
13
@athewsey athewsey added the python Relates to the Python version of TRP label Aug 28, 2023
@pranavbhat12
Copy link

Hey @GradyMellin I am also facing the same issue.Did you get any workaround for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Relates to the Python version of TRP
Projects
None yet
Development

No branches or pull requests

3 participants