-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_parquet from public parquet file with AWS credentials in environment gives OSError #53701
Comments
Hi, thanks for opening this issue. I can confirm that read_parquet is the issue here (both fastparquet + pyarrow engines). Internally, the read_parquet code doesn't go through the retry mechanism you mentioned, since it relies on the engine to open the file correctly, but read_csv does. I think it'd make sense for parquet to go through the same path if possible. If you're interested, PRs would be welcome. Otherwise, I'll try to get to this in a couple of weeks. |
take |
Hello, I am a new contributor and just submitted my first PR for this issue(#53895). I added the retry mechanism in the file handling method for PyArrow when read_parquet is invoked, and the above reproducible example no longer gives an OSError. Any feedback would be appreciated! |
Hi, is this bug resolved? I am using pandas 2.1.0 I still have this issue. |
Hi, is this bug resolved? In our local testing it seems we still have to pin |
Similar error in
This works around the issue, but it'd be great to fix it. It started broken after |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I'm trying to use pandas to read a public CSV file and a public parquet file from an s3 bucket. When I have the environment variables
AWS_ACCESS_KEY_ID
andAWS_ACCESS_KEY_ID
set to anything (my valid credentials or some invalid credentials), reading the CSV works, but reading the parquet file gives an error from an AWS "ACCESS_DENIED" error.When I remove
AWS_ACCESS_KEY_ID
andAWS_ACCESS_KEY_ID
, I can read with both CSV and parquet.Note that in the example above I have set the correct
AWS_DEFAULT_REGION
and I have also setAWS_CONFIG_FILE
andAWS_SHARED_CREDENTIALS_FILE
to a nonexistent location to illustrate that whether I have credentials in~/.aws
doesn't matter.Expected Behavior
I should be able to read both the CSV file and the parquet file from s3, whether or not I have any kind of credentials in my environment. Both the files are open to the public. I think pandas is supposed to use the retry mechanism here for this.
Installed Versions
INSTALLED VERSIONS
commit : 0bc16da
python : 3.9.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.0.dev0+977.g0bc16da1e5
numpy : 2.0.0.dev0+84.g828fba29e
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : 7.3.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.6.0
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: