Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: large datetimes incorrect in read_sas #56014

Open
jbrockmendel opened this issue Nov 16, 2023 · 4 comments
Open

BUG: large datetimes incorrect in read_sas #56014

jbrockmendel opened this issue Nov 16, 2023 · 4 comments
Labels
Bug IO SAS SAS: read_sas

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 16, 2023

There are a few tests in tests.io.sas.test_sas7bdat that need attention from someone with access to SAS.

def test_max_sas_date(datapath):
    # GH 20927
    # NB. max datetime in SAS dataset is 31DEC9999:23:59:59.999
    #    but this is read as 29DEC9999:23:59:59.998993 by a buggy
    #    sas7bdat module

We need to read the file "pandas/tests/io/sas/data/max_sas_date.sas7bdat" with SAS and inspect the first entry in the dt_as_dt column to confirm the expected behavior. It is also unclear what precision the result should have: the comment suggests milliseconds, we get microseconds, and pyreadstat rounds to seconds (xref WizardMac/ReadStat#306)

def test_date_time(datapath):
    [...]
    # GH 19732: Timestamps imported from sas will incur floating point errors

Similarly we need to read the file "pandas/tests/io/sas/data/datetime.sas7bdat" with SAS and inspect the first and last entries in the "DateTimeHi" column to ensure we are getting these right.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Testing pandas testing functions or related to the test suite IO SAS SAS: read_sas and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 16, 2023
@ChadFulton
Copy link

ChadFulton commented Nov 30, 2023

Opening max_sas_date.sas7bdat in "SAS 9.4 TS Level 1M6" in Linux yields the following table (in "SAS:VIEWTABLE" in "Browse" mode):

text dt_as_float dt_as_dt date_as_float date_as_date
1 max 253717747199.999 31DEC9999:23:59:59.999 2936547 9999-12-31
2 normal 1880323199.999 01AUG2019:23:59:59.999 21762 2019-08-01

@ChadFulton
Copy link

And the datetime.sas7bdat shows:

Date1 Date2 DateTime DateTimeHi Taiw
1 1677-09-22 22SEP77 21SEP1677:00:12:44 21SEP1677:00:12:43.145225525 0001/01/01
2 1960-01-01 01JAN60 01JAN1960:00:00:00 01JAN1960:00:00:00.000000000 0049/01/01
3 2016-02-29 29FEB16 29FEB2016:23:59:59 29FEB2016:23:59:59.123456001 0105/02/29
4 2262-04-11 11APR62 11APR2262:23:47:16 11APR2262:23:47:16.854774475 0351/04/11

@jbrockmendel
Copy link
Member Author

Thanks @ChadFulton, i'll update the comments in the file to point back here as the Source Of Truth.

@jbrockmendel jbrockmendel changed the title TST: verify "expected" datetimes in SAS tests BUG: large datetimes incorrect in read_sas Dec 17, 2023
@jbrockmendel jbrockmendel added Bug and removed Testing pandas testing functions or related to the test suite labels Dec 17, 2023
@gtumuluri
Copy link

gtumuluri commented Oct 3, 2024

I am also having this error where the read_sas simply fails and there is no other option whatsoever to read and salvage the file / good data.

>>> df = pd.read_sas('my_sas_file.sas7bdat')

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 83, in _convert_datetimes
    return pd.to_datetime(sas_datetimes, unit=unit, origin="1960-01-01")
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 1068, in to_datetime
    values = convert_listlike(arg._values, format)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 393, in _convert_listlike_datetimes
    return _to_datetime_with_unit(arg, unit, name, tz, errors)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 557, in _to_datetime_with_unit
    arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
  File "pandas/_libs/tslib.pyx", line 312, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: cannot convert input with unit 's'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sasreader.py", line 175, in read_sas
    return reader.read()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 742, in read
    rslt = self._chunk_to_dataframe()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 792, in _chunk_to_dataframe
    rslt[name] = _convert_datetimes(rslt[name], "s")
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 85, in _convert_datetimes
    s_series = sas_datetimes.apply(_parse_datetime, unit=unit)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 142, in f
    return func(x, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 55, in _parse_datetime
    return datetime(1960, 1, 1) + timedelta(seconds=sas_datetime)
OverflowError: days=-1176508800; must have magnitude <= 999999999

Reading the same file in R using the haven package works just fine and actually shows the bad timestamps clearly.

> library(haven)
> df <- read_sas('my_sas_file.sas7bdat')
> summary(df$DtObgnOrig)
                          Min.                        1st Qu. 
"-3219212-04-24 00:00:00.0000"     "2004-01-27 00:00:00.0000" 
                        Median                           Mean 
    "2008-08-11 00:00:00.0000"     "2009-03-08 10:15:16.7828" 
                       3rd Qu.                           Max. 
    "2014-12-09 00:00:00.0000"     "2027-06-12 00:00:00.0000" 
                          NA's 
                    "93215826" 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

No branches or pull requests

3 participants