Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Comment in ODS-file gets included in string cells #55200

Closed
2 of 3 tasks
pyZerrenner opened this issue Sep 19, 2023 · 6 comments · Fixed by #55727
Closed
2 of 3 tasks

BUG: Comment in ODS-file gets included in string cells #55200

pyZerrenner opened this issue Sep 19, 2023 · 6 comments · Fixed by #55727
Assignees
Labels
Bug IO Excel read_excel, to_excel

Comments

@pyZerrenner
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.read_excel('TableWithComment.ods', engine='odf', header=0)
print(df)

Issue Description

Here is the example *.ods file for the example code: TableWithComment.ods. The file was create using LibreOffice Calc 7.4.7.2
TableWithComment

The print(df) command produces the following output

   Quantity 1 2023-09-19T00:00:00Comment IQuantity 2
0          10                                     20
1          11                                     21
2          12                                     22
3          13                                     23
4          14                                     24
5          15      2023-09-19T00:00:00Comment IIIABC
6          16                                     26
7          17                                     27
8          18                                     28

If a cell in the *.ods file has a comment and the cell content is a string (B1 and B7), the comment text and timestamp are appended in front of the cell content. This also applies to the header line. For cells containing numbers, the comment is ignored (A6 and B10).

(Note, that I am referring to comments inserted using LibreOffice Calc itself. This is unrelated to the comment argument of read_excel.)

Expected Behavior

The ods-comments should be ignored and only the cell content read into the dataframe. The expected output from print(df) is

   Quantity 1 Quantity 2
0          10         20
1          11         21
2          12         22
3          13         23
4          14         24
5          15        ABC
6          16         26
7          17         27
8          18         28

(This is the output when all comments in the *.ods file are deleted)

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.10.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : de_DE.cp1252

pandas : 2.1.0
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.4.0
pyqt5 : None

(The installed odfpy version is 1.4.1, which somehow is not shown in the above output)

Libre Office Version

Version: 7.4.7.2 (x64) / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 12; OS: Windows 10.0 Build 17763; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: de-DE
Calc: threaded

@pyZerrenner pyZerrenner added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2023
@paulreece
Copy link
Contributor

paulreece commented Sep 19, 2023

I can confirm that this bug exists on the main branch 2.1.0.

@rhshadrach rhshadrach added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2023
@rhshadrach
Copy link
Member

Thanks for the report! Further investigations and PRs to fix are welcome!

@SuadHus
Copy link

SuadHus commented Sep 26, 2023

take

@Leohemmingsson
Copy link

take

@dimastbk
Copy link
Contributor

dimastbk commented Oct 3, 2023

Maybe just skip office:annotation?

diff --git a/pandas/io/excel/_odfreader.py b/pandas/io/excel/_odfreader.py
index 277f64f636..48677468c7 100644
--- a/pandas/io/excel/_odfreader.py
+++ b/pandas/io/excel/_odfreader.py
@@ -206,7 +206,11 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
             cell_value = cell.attributes.get((OFFICENS, "value"))
             return float(cell_value)
         elif cell_type == "string":
            return self._get_cell_string_value(cell)
         elif cell_type == "currency":
             cell_value = cell.attributes.get((OFFICENS, "value"))
             return float(cell_value)
@@ -228,8 +232,10 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
         """
         from odf.element import Element
         from odf.namespaces import TEXTNS
+        from odf.office import Annotation
         from odf.text import S
 
+        office_annotation = Annotation().qname
         text_s = S().qname
 
         value = []
@@ -239,6 +245,8 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
                 if fragment.qname == text_s:
                     spaces = int(fragment.attributes.get((TEXTNS, "c"), 1))
                     value.append(" " * spaces)
+                elif fragment.qname == office_annotation:
+                    continue
                 else:
                     # recursive impl needed in case of nested fragments
                     # with multiple spaces

Or extract only text:p/text:s (full list of possible elements here).

@Leohemmingsson
Copy link

@dimastbk I tried these changes and it looks good, you should open a MR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants