BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53701

mvashishtha · 2023-06-16T19:42:45Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import os
import pandas as pd

os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["AWS_CONFIG_FILE"] = "invalid"
os.environ["AWS_SHARED_CREDENTIALS_FILE"] = "invalid"

os.environ["AWS_ACCESS_KEY_ID"] = "invalid"
os.environ["AWS_SECRET_ACCESS_KEY"] = "invalid"
# displays without error
display(pd.read_csv("s3://modin-datasets/testing/multiple_csv/test_data0.csv"))
# OSError from AWS error ACCESS_DENIED
display(pd.read_parquet("s3://modin-datasets/testing/test_data.parquet"))

del os.environ["AWS_ACCESS_KEY_ID"]
del os.environ["AWS_SECRET_ACCESS_KEY"]
# displays without error
display(pd.read_csv("s3://modin-datasets/testing/multiple_csv/test_data0.csv"))
# displays without error
display(pd.read_parquet("s3://modin-datasets/testing/test_data.parquet"))

Issue Description

I'm trying to use pandas to read a public CSV file and a public parquet file from an s3 bucket. When I have the environment variables AWS_ACCESS_KEY_ID and AWS_ACCESS_KEY_ID set to anything (my valid credentials or some invalid credentials), reading the CSV works, but reading the parquet file gives an error from an AWS "ACCESS_DENIED" error.

When I remove AWS_ACCESS_KEY_ID and AWS_ACCESS_KEY_ID, I can read with both CSV and parquet.

Note that in the example above I have set the correct AWS_DEFAULT_REGION and I have also set AWS_CONFIG_FILE and AWS_SHARED_CREDENTIALS_FILE to a nonexistent location to illustrate that whether I have credentials in ~/.aws doesn't matter.

Expected Behavior

I should be able to read both the CSV file and the parquet file from s3, whether or not I have any kind of credentials in my environment. Both the files are open to the public. I think pandas is supposed to use the retry mechanism here for this.

Installed Versions

In [3]: pd.show_versions() /Users/maheshvashishtha/anaconda3/envs/pandas-dev-py39/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : 0bc16da
python : 3.9.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+977.g0bc16da1e5
numpy : 2.0.0.dev0+84.g828fba29e
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : 7.3.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.6.0
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

lithomas1 · 2023-06-18T16:54:41Z

Hi, thanks for opening this issue.

I can confirm that read_parquet is the issue here (both fastparquet + pyarrow engines). Internally, the read_parquet code doesn't go through the retry mechanism you mentioned, since it relies on the engine to open the file correctly, but read_csv does.

I think it'd make sense for parquet to go through the same path if possible.

If you're interested, PRs would be welcome. Otherwise, I'll try to get to this in a couple of weeks.

SanjithChockan · 2023-06-27T02:37:17Z

take

SanjithChockan · 2023-06-28T01:11:33Z

Hello, I am a new contributor and just submitted my first PR for this issue(#53895).

I added the retry mechanism in the file handling method for PyArrow when read_parquet is invoked, and the above reproducible example no longer gives an OSError.

Any feedback would be appreciated!

Dinesh-N · 2023-11-01T20:53:42Z

Hi, is this bug resolved? I am using pandas 2.1.0 I still have this issue.

jpaye · 2024-02-12T15:08:44Z

Hi, is this bug resolved? In our local testing it seems we still have to pin 2.0.3 to avoid this error

jimxliu · 2024-10-31T18:24:08Z

Similar error in to_parquet when the path is s3://. It seemed unable to detect ~/.aws/config, and used the server's (EC2 in my case) AWS role instead, which doesn't have push access. Repro error:

>>> import pandas as pd
>>> df = pd.DataFrame(data={"a": [1,2,3], "b": [4,5,6]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6
>>> df.to_parquet("s3://my-test-bucket/tmp/test_pandas_upgrade.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "/home/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3113, in to_parquet
    return to_parquet(
  File "/home/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 480, in to_parquet
    impl.write(
  File "/home/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 228, in write
    self.api.parquet.write_table(
File "/home/.local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1883, in write_table
    with ParquetWriter(
  File "/home/.local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1004, in __init__
    sink = self.file_handle = filesystem.open_output_stream(
  File "pyarrow/_fs.pyx", line 878, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When initiating multiple part upload for key 'tmp/test_pandas_upgrade.parquet' in bucket 'my-test-bucket': AWS Error ACCESS_DENIED during CreateMultipartUpload operation: User: <arn:aws:sts::assumed_role> is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::test-bucket-staging/tmp/test_pandas_upgrade.parquet" because no identity-based policy allows the s3:PutObject action

$ pip3 list | grep -E "pandas|pyarrow"
pandas                   2.2.3
pyarrow                  15.0.2

Hi, is this bug resolved? In our local testing it seems we still have to pin 2.0.3 to avoid this error

This works around the issue, but it'd be great to fix it. It started broken after 2.0.3

mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2023

lithomas1 added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2023

github-actions bot assigned SanjithChockan Jun 27, 2023

SanjithChockan mentioned this issue Jun 28, 2023

BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53895

Closed

1 task

SanjithChockan mentioned this issue Aug 30, 2023

BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #54878

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53701

BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53701

mvashishtha commented Jun 16, 2023

INSTALLED VERSIONS

lithomas1 commented Jun 18, 2023

SanjithChockan commented Jun 27, 2023

SanjithChockan commented Jun 28, 2023

Dinesh-N commented Nov 1, 2023 •

edited

Loading

jpaye commented Feb 12, 2024

jimxliu commented Oct 31, 2024 •

edited

Loading

BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53701

BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53701

Comments

mvashishtha commented Jun 16, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

lithomas1 commented Jun 18, 2023

SanjithChockan commented Jun 27, 2023

SanjithChockan commented Jun 28, 2023

Dinesh-N commented Nov 1, 2023 • edited Loading

jpaye commented Feb 12, 2024

jimxliu commented Oct 31, 2024 • edited Loading

Dinesh-N commented Nov 1, 2023 •

edited

Loading

jimxliu commented Oct 31, 2024 •

edited

Loading