BUG: Comment in ODS-file gets included in string cells #55200

pyZerrenner · 2023-09-19T14:17:01Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.read_excel('TableWithComment.ods', engine='odf', header=0)
print(df)

Issue Description

Here is the example *.ods file for the example code: TableWithComment.ods. The file was create using LibreOffice Calc 7.4.7.2

The print(df) command produces the following output

   Quantity 1 2023-09-19T00:00:00Comment IQuantity 2
0          10                                     20
1          11                                     21
2          12                                     22
3          13                                     23
4          14                                     24
5          15      2023-09-19T00:00:00Comment IIIABC
6          16                                     26
7          17                                     27
8          18                                     28

If a cell in the *.ods file has a comment and the cell content is a string (B1 and B7), the comment text and timestamp are appended in front of the cell content. This also applies to the header line. For cells containing numbers, the comment is ignored (A6 and B10).

(Note, that I am referring to comments inserted using LibreOffice Calc itself. This is unrelated to the comment argument of read_excel.)

Expected Behavior

The ods-comments should be ignored and only the cell content read into the dataframe. The expected output from print(df) is

   Quantity 1 Quantity 2
0          10         20
1          11         21
2          12         22
3          13         23
4          14         24
5          15        ABC
6          16         26
7          17         27
8          18         28

(This is the output when all comments in the *.ods file are deleted)

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.10.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : de_DE.cp1252

pandas : 2.1.0
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.4.0
pyqt5 : None

(The installed odfpy version is 1.4.1, which somehow is not shown in the above output)

Libre Office Version

Version: 7.4.7.2 (x64) / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 12; OS: Windows 10.0 Build 17763; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: de-DE
Calc: threaded

The text was updated successfully, but these errors were encountered:

paulreece · 2023-09-19T19:42:02Z

I can confirm that this bug exists on the main branch 2.1.0.

rhshadrach · 2023-09-19T20:46:04Z

Thanks for the report! Further investigations and PRs to fix are welcome!

SuadHus · 2023-09-26T09:08:02Z

take

Leohemmingsson · 2023-09-26T09:09:21Z

take

dimastbk · 2023-10-03T08:08:39Z

Maybe just skip office:annotation?

diff --git a/pandas/io/excel/_odfreader.py b/pandas/io/excel/_odfreader.py
index 277f64f636..48677468c7 100644
--- a/pandas/io/excel/_odfreader.py
+++ b/pandas/io/excel/_odfreader.py
@@ -206,7 +206,11 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
             cell_value = cell.attributes.get((OFFICENS, "value"))
             return float(cell_value)
         elif cell_type == "string":
            return self._get_cell_string_value(cell)
         elif cell_type == "currency":
             cell_value = cell.attributes.get((OFFICENS, "value"))
             return float(cell_value)
@@ -228,8 +232,10 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
         """
         from odf.element import Element
         from odf.namespaces import TEXTNS
+        from odf.office import Annotation
         from odf.text import S
 
+        office_annotation = Annotation().qname
         text_s = S().qname
 
         value = []
@@ -239,6 +245,8 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
                 if fragment.qname == text_s:
                     spaces = int(fragment.attributes.get((TEXTNS, "c"), 1))
                     value.append(" " * spaces)
+                elif fragment.qname == office_annotation:
+                    continue
                 else:
                     # recursive impl needed in case of nested fragments
                     # with multiple spaces

Or extract only text:p/text:s (full list of possible elements here).

Leohemmingsson · 2023-10-04T09:06:28Z

@dimastbk I tried these changes and it looks good, you should open a MR :)

pyZerrenner added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2023

rhshadrach added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2023

github-actions bot assigned SuadHus Sep 26, 2023

github-actions bot assigned Leohemmingsson Sep 26, 2023

mzuzek mentioned this issue Sep 29, 2023

BUG: fix parsing of ODF time values with comments #55324

Closed

5 tasks

Leohemmingsson removed their assignment Oct 15, 2023

dimastbk mentioned this issue Oct 27, 2023

BUG: Comment in ODS-file gets included in string cells #55727

Merged

5 tasks

mroeschke closed this as completed in #55727 Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Comment in ODS-file gets included in string cells #55200

BUG: Comment in ODS-file gets included in string cells #55200

pyZerrenner commented Sep 19, 2023

INSTALLED VERSIONS

Libre Office Version

paulreece commented Sep 19, 2023 •

edited

Loading

rhshadrach commented Sep 19, 2023

SuadHus commented Sep 26, 2023

Leohemmingsson commented Sep 26, 2023

dimastbk commented Oct 3, 2023

Leohemmingsson commented Oct 4, 2023

BUG: Comment in ODS-file gets included in string cells #55200

BUG: Comment in ODS-file gets included in string cells #55200

Comments

pyZerrenner commented Sep 19, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

Libre Office Version

paulreece commented Sep 19, 2023 • edited Loading

rhshadrach commented Sep 19, 2023

SuadHus commented Sep 26, 2023

Leohemmingsson commented Sep 26, 2023

dimastbk commented Oct 3, 2023

Leohemmingsson commented Oct 4, 2023

paulreece commented Sep 19, 2023 •

edited

Loading