Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53701

Open
3 tasks done
mvashishtha opened this issue Jun 16, 2023 · 6 comments
Open
3 tasks done
Assignees
Labels
Bug IO Parquet parquet, feather

Comments

@mvashishtha
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import os
import pandas as pd

os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["AWS_CONFIG_FILE"] = "invalid"
os.environ["AWS_SHARED_CREDENTIALS_FILE"] = "invalid"

os.environ["AWS_ACCESS_KEY_ID"] = "invalid"
os.environ["AWS_SECRET_ACCESS_KEY"] = "invalid"
# displays without error
display(pd.read_csv("s3://modin-datasets/testing/multiple_csv/test_data0.csv"))
# OSError from AWS error ACCESS_DENIED
display(pd.read_parquet("s3://modin-datasets/testing/test_data.parquet"))

del os.environ["AWS_ACCESS_KEY_ID"]
del os.environ["AWS_SECRET_ACCESS_KEY"]
# displays without error
display(pd.read_csv("s3://modin-datasets/testing/multiple_csv/test_data0.csv"))
# displays without error
display(pd.read_parquet("s3://modin-datasets/testing/test_data.parquet"))

Issue Description

I'm trying to use pandas to read a public CSV file and a public parquet file from an s3 bucket. When I have the environment variables AWS_ACCESS_KEY_ID and AWS_ACCESS_KEY_ID set to anything (my valid credentials or some invalid credentials), reading the CSV works, but reading the parquet file gives an error from an AWS "ACCESS_DENIED" error.

When I remove AWS_ACCESS_KEY_ID and AWS_ACCESS_KEY_ID, I can read with both CSV and parquet.

Note that in the example above I have set the correct AWS_DEFAULT_REGION and I have also set AWS_CONFIG_FILE and AWS_SHARED_CREDENTIALS_FILE to a nonexistent location to illustrate that whether I have credentials in ~/.aws doesn't matter.

Expected Behavior

I should be able to read both the CSV file and the parquet file from s3, whether or not I have any kind of credentials in my environment. Both the files are open to the public. I think pandas is supposed to use the retry mechanism here for this.

Installed Versions

In [3]: pd.show_versions() /Users/maheshvashishtha/anaconda3/envs/pandas-dev-py39/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : 0bc16da
python : 3.9.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+977.g0bc16da1e5
numpy : 2.0.0.dev0+84.g828fba29e
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : 7.3.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.6.0
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@mvashishtha mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2023
@lithomas1 lithomas1 added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2023
@lithomas1
Copy link
Member

Hi, thanks for opening this issue.

I can confirm that read_parquet is the issue here (both fastparquet + pyarrow engines). Internally, the read_parquet code doesn't go through the retry mechanism you mentioned, since it relies on the engine to open the file correctly, but read_csv does.

I think it'd make sense for parquet to go through the same path if possible.

If you're interested, PRs would be welcome. Otherwise, I'll try to get to this in a couple of weeks.

@SanjithChockan
Copy link
Contributor

take

@SanjithChockan
Copy link
Contributor

Hello, I am a new contributor and just submitted my first PR for this issue(#53895).

I added the retry mechanism in the file handling method for PyArrow when read_parquet is invoked, and the above reproducible example no longer gives an OSError.

Any feedback would be appreciated!

@Dinesh-N
Copy link

Dinesh-N commented Nov 1, 2023

Hi, is this bug resolved? I am using pandas 2.1.0 I still have this issue.

@jpaye
Copy link

jpaye commented Feb 12, 2024

Hi, is this bug resolved? In our local testing it seems we still have to pin 2.0.3 to avoid this error

@jimxliu
Copy link

jimxliu commented Oct 31, 2024

Similar error in to_parquet when the path is s3://. It seemed unable to detect ~/.aws/config, and used the server's (EC2 in my case) AWS role instead, which doesn't have push access. Repro error:

>>> import pandas as pd
>>> df = pd.DataFrame(data={"a": [1,2,3], "b": [4,5,6]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6
>>> df.to_parquet("s3://my-test-bucket/tmp/test_pandas_upgrade.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "/home/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3113, in to_parquet
    return to_parquet(
  File "/home/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 480, in to_parquet
    impl.write(
  File "/home/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 228, in write
    self.api.parquet.write_table(
File "/home/.local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1883, in write_table
    with ParquetWriter(
  File "/home/.local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1004, in __init__
    sink = self.file_handle = filesystem.open_output_stream(
  File "pyarrow/_fs.pyx", line 878, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When initiating multiple part upload for key 'tmp/test_pandas_upgrade.parquet' in bucket 'my-test-bucket': AWS Error ACCESS_DENIED during CreateMultipartUpload operation: User: <arn:aws:sts::assumed_role> is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::test-bucket-staging/tmp/test_pandas_upgrade.parquet" because no identity-based policy allows the s3:PutObject action
$ pip3 list | grep -E "pandas|pyarrow"
pandas                   2.2.3
pyarrow                  15.0.2

Hi, is this bug resolved? In our local testing it seems we still have to pin 2.0.3 to avoid this error

This works around the issue, but it'd be great to fix it. It started broken after 2.0.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
6 participants