Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ExcelWriter set_column with num_format datetime doesn't work (but it works for xlsxwriter) #55196

Open
3 tasks done
lrisch opened this issue Sep 19, 2023 · 12 comments
Open
3 tasks done
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO Excel read_excel, to_excel Needs Discussion Requires discussion from core team before further action

Comments

@lrisch
Copy link

lrisch commented Sep 19, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from datetime import datetime
import pandas as pd

writer = pd.ExcelWriter('datetime_bug.xlsx', engine='xlsxwriter')
workbook = writer.book

# datetime from pandas is not converted with the specified num_format
data = pd.DataFrame([[0.05, datetime(year=2020, month=1, day=1)]], columns=["percentage", "datetime_date"])

data.to_excel(writer, sheet_name="test_sheet", index=False)

percentage_format = workbook.add_format({'num_format':'0.00%'})
date_format = workbook.add_format({'num_format':'dd.mm.yyyy'})

writer.sheets["test_sheet"].set_column('A:A', width=None, cell_format=percentage_format)
writer.sheets["test_sheet"].set_column('B:B', width=None, cell_format=date_format)

workbook.close()

Issue Description

I created an Excel sheet using pd.to_excel(). The data contains one column with a float number and one column with a datetime.datetime object.

Then, I want to set the column type using the num_format parameter in set_column.

This works for the float value, which I want in the percentage format.
However, it does not work for the datetime value. The value is in the wrong format "01.01.2020 00:00:00" (German date). The value in the Excel file is in the same wrong format, whether I run set_column() for the date column or not.

I have tried to circle in the problem by trying other variations, which turns out that they do work. However, those would be only workarounds for our use case.

  1. Using the parameter datetime_format='dd.mm.yyyy' in the pd.ExcelWriter constructor puts the date into the right format.

  2. Instead of writing the sheet from a DataFrame, using the xlsxwriter functions works fine:

import xlsxwriter
from datetime import datetime

workbook = xlsxwriter.Workbook('datetime_bug.xlsx')
worksheet = workbook.add_worksheet()

# using the xlsxwriter functions works as intended
worksheet.write('A1', 0.05)
worksheet.write('B1', datetime(year=2020, month=1, day=1))

percentage_format = workbook.add_format({'num_format':'0.00%'})
date_format = workbook.add_format({'num_format':'dd.mm.yyyy'})

worksheet.set_column('A:A', width=None, cell_format=percentage_format)
worksheet.set_column('B:B', width=None, cell_format=date_format)

workbook.close()

Expected Behavior

The value is in the format "01.01.2020" when using set_column with pd.ExcelWriter.

Installed Versions

INSTALLED VERSIONS ------------------ commit : ba1cccd python : 3.9.11.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-477.21.1.el8_8.x86_64 Version : #1 SMP Thu Jul 20 08:38:27 EDT 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.26.0
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.7.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.3
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 3.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@lrisch lrisch added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Sep 19, 2023

set_column does not change the format; it changes the default format:

The cell_format parameter will be applied to any cells in the column that don’t have a format.

Since pandas is setting the format for date cells, it has no effect. As far as I can tell there is no easy way to overwrite an existing format with xlsxwriter. The best resolution I see for this within pandas is to use datetime_format as you mentioned.

@rhshadrach rhshadrach added IO Excel read_excel, to_excel Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2023
@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Oct 13, 2023
@Pavanmahaveer7
Copy link

So here if we are looking to change any format of date column, by using strftime function we could make the possible changes to the date values.

@rhshadrach
Copy link
Member

@Pavanmahaveer7 - I think you are saying to modify the format prior calling pd.write_excel. Yes, that is an option, but then I believe the Excel file will not have the Date datatype.

@lrisch
Copy link
Author

lrisch commented Oct 27, 2023

@rhshadrach Okay thanks for that information, I didn't know it only changed the default format. In that case we will keep using datetime_format 👍

@rhshadrach
Copy link
Member

rhshadrach commented Oct 30, 2023

Currently the default value of datetime_formatstr is None, which gives you the format "YYYY-MM-DD HH:MM:SS". We could instead make the default value be ""YYYY-MM-DD HH:MM:SS" and then if None is passed, apply no formatting. It would allow you to then apply your own formatting with xlsxwriter. However, without applying any formatting I think the values would then appear as integers in the Excel sheet - and this would perhaps cause confusion.

I still think a better resolution is to have a way to override the existing format via xlsxwriter.

@lrisch
Copy link
Author

lrisch commented Nov 17, 2023

@rhshadrach Im sorry, I dont really understand your last message.
As I understand it, I cannot use set_column because pandas is overwriting whatever I set as a format there. Therefore I have to use datetime_format (which is fine, it would have been nicer to be able to use set_column since we use it for the other datatypes to, but it is ok as a workaround).
But I dont understand, why do you want to change the default of datetime_format?
And what do you mean by this: "I still think a better resolution is to have a way to override the existing format via xlsxwriter."

@rhshadrach
Copy link
Member

As I understand it, I cannot use set_column because pandas is overwriting whatever I set as a format there.

Depending on interpretation, I think this is not correct - in particular pandas is not overwriting. pandas sets the format as "YYYY-MM-DD HH:MM:SS" first and then you are calling set_column. The issue is that calling set_column does nothing because it only modifies the default format. Modifying the default format has no impact because pandas has set a (non-default) format.

If users could tell pandas to not set any format for datetime (my proposal above - which I do not like but it is an option), then changing the default format would have an impact and so your call to set_column would have the desired effect.

It seems better to me if xlsxwriter gave you the ability to modify the format on an entire column, and not just the default format. I believe xlsxwriter has no such function.

@rhshadrach
Copy link
Member

Perhaps my use of "default format" is confusing and I should be saying "no format" instead.

@rhshadrach
Copy link
Member

Looking into this again, I'm now understanding that Excel has cell, row, and column formats. The set_column sets the column format, but priority is given to any cell in the column that has a cell format set.

Also, ExcelWriter does have date_format and datetime_format arguments that can be set!

@rhshadrach
Copy link
Member

@jmcnamara - I was wondering if you could confirm my understanding here is correct. I believe this comment is self-contained, so hopefully you don't have to wade through the above comments 😄.

pandas writes dates and datetimes by applying a cell format to particular cells. This makes xlxswriter's set_column (and set_row) ineffective on such cells - perhaps surprisingly until you understand how Excel has different formatters for rows, columns, and cells.

I was wondering if pandas should do something different here - but it appears there is no "default date format" in Excel. XlsxWriter does have a default_date_format, but it appears this is implemented by setting the cell format, similar to pandas:

https://github.com/jmcnamara/XlsxWriter/blob/cedf448ad2139ae3fb88dd618b0d78d67a39f2f5/xlsxwriter/worksheet.py#L1102-L1107

So users wanting to change the format of date and datetime columns that already have a format applied just need to loop over the cells in a column.

@jmcnamara
Copy link
Contributor

I was wondering if you could confirm my understanding here is correct.

Yes that is correct. To reiterate: Pandas applies a cell format when writing a datetime and that cell format can't/won't be overwritten by a row or column format.

@rhshadrach I think the proposal to have change the behaviour of datetime_format=None and date_format=None in pd.ExcelWriter() so that they don't apply a default format is a good one. That would allow the user to apply a format via set_column(). In most cases applying the number format via datetime_format and date_format is sufficient for most users but sometimes they need to also apply font or border formatting as well and then the ability to turn off the Pandas formatting and use set_column() instead would be useful.

For reference here are the XlsxWriter docs where I try to point people in the right direction on Formatting of the Dataframe output.

@rhshadrach
Copy link
Member

Thanks @jmcnamara!

@rhshadrach I think the proposal to have change the behaviour of datetime_format=None and date_format=None in pd.ExcelWriter() so that they don't apply a default format is a good one.

Makes sense - but I would think just not as the default. I'll put up a PR for this and improve some of the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO Excel read_excel, to_excel Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants