Decimal fields dropped in group by with more than one column #22275

drudd · 2018-08-10T22:31:34Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import decimal

df = pd.DataFrame({'a': [decimal.Decimal('4.56')]*6, 'b': range(3, 6)*2, 'c': range(6)})
print df.groupby('b')['a'].sum()
print df.groupby('b')['a', 'c'].sum()
print df.groupby('b').agg({'a': 'sum', 'c': 'sum'})

Output:

b
3    9.12
4    9.12
5    9.12
Name: a, dtype: object
   c
b
3  3
4  5
5  7
      a  c
b
3  9.12  3
4  9.12  5
5  9.12  7

Problem description

The aggregation over column a is dropped when another field is accessed from the groupby object, but works when requested through agg (I'm not sure if 'sum' is exactly equivalent to .sum() in the above).

Expected Output

b
3    9.12
4    9.12
5    9.12
Name: a, dtype: object
      a  c
b
3  9.12  3
4  9.12  5
5  9.12  7
      a  c
b
3  9.12  3
4  9.12  5
5  9.12  7

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.8.0
xarray: None
IPython: 5.7.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.4
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

drudd · 2018-08-10T23:14:43Z

Tracing through, the 'a' column appears to be dropped by this section of NDFrameGroupBy. _cython_agg_blocks:

        if numeric_only:
            data = data.get_numeric_data(copy=False)

Specifically this line:

pandas/pandas/core/groupby/groupby.py

Line 4006 in 0409521

data = data.get_numeric_data(copy=False)

This suggests to me that there's a mismatch with the logic that sets numeric_only (True), and .get_numeric_data which skips the Decimal field, encoded as objects.

WillAyd · 2018-08-10T23:26:09Z

There is no native support for Decimal types in NumPy and therefore it is also generally weakly supported in pandas. In theory a community-contributed Extension would help improve that.

In the meantime though I do agree that it's strange that there is a difference between the Series and DataFrame GroupBy mechanism's. Investigation and PRs to align that behavior are certainly welcome

drudd · 2018-08-10T23:35:15Z

My above comment appears to be correct, numeric_only = True is set for sum when that groupby function is created:

pandas/pandas/core/groupby/groupby.py

Line 1466 in 0409521

cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)

Which can be seen by using a non-numeric_only group function such as min:

df.groupby('b')['a', 'c'].min()

produces

Not knowing this api very closely I'm surprised that an exception isn't raised earlier in the process if one of the fields is non-numeric (and would be happy if it did so rather than silently dropping data).

drudd · 2018-08-10T23:40:06Z

Another tactic would be to remove the get_numeric_data call and rely on the try/except later in the loop which should get raised if there is no appropriate aggregation function.

I have tested this and it restores consistency between the two methods. Happy to PR if we can be confident this won't break other logic.

WillAyd · 2018-08-11T00:27:53Z

If you have something that works and doesn't break other tests locally feel free to submit as a PR

drudd · 2018-08-11T01:06:56Z

Unfortunately it causes a series of test failures, so the solution isn't quite so simple.

drudd · 2018-08-11T02:45:40Z

This bug is similar to #21664 and I think the failing tests may be expecting
http://pandas.pydata.org/pandas-docs/stable/groupby.html#automatic-exclusion-of-nuisance-columns

jschendel · 2018-08-12T01:04:48Z

There is no native support for Decimal types in NumPy and therefore it is also generally weakly supported in pandas. In theory a community-contributed Extension would help improve that.

As a side note, we actually have this within our test suite, albeit in a fledgling state:

pandas/pandas/tests/extension/decimal/array.py

Lines 37 to 118 in 0370740

    
           class DecimalArray(ExtensionArray, ExtensionScalarOpsMixin): 
        
               dtype = DecimalDtype() 
        
               def __init__(self, values, dtype=None, copy=False): 
        
                   for val in values: 
        
                       if not isinstance(val, self.dtype.type): 
        
                           raise TypeError("All values must be of type " + 
        
                                           str(self.dtype.type)) 
        
                   values = np.asarray(values, dtype=object) 
        
                   self._data = values 
        
                   # Some aliases for common attribute names to ensure pandas supports 
        
                   # these 
        
                   self._items = self.data = self._data 
        
                   # those aliases are currently not working due to assumptions 
        
                   # in internal code (GH-20735) 
        
                   # self._values = self.values = self.data 
        
               @classmethod 
        
               def _from_sequence(cls, scalars, dtype=None, copy=False): 
        
                   return cls(scalars) 
        
               @classmethod 
        
               def _from_factorized(cls, values, original): 
        
                   return cls(values) 
        
               def __getitem__(self, item): 
        
                   if isinstance(item, numbers.Integral): 
        
                       return self._data[item] 
        
                   else: 
        
                       return type(self)(self._data[item]) 
        
               def take(self, indexer, allow_fill=False, fill_value=None): 
        
                   from pandas.api.extensions import take 
        
                   data = self._data 
        
                   if allow_fill and fill_value is None: 
        
                       fill_value = self.dtype.na_value 
        
                   result = take(data, indexer, fill_value=fill_value, 
        
                                 allow_fill=allow_fill) 
        
                   return self._from_sequence(result) 
        
               def copy(self, deep=False): 
        
                   if deep: 
        
                       return type(self)(self._data.copy()) 
        
                   return type(self)(self) 
        
               def __setitem__(self, key, value): 
        
                   if pd.api.types.is_list_like(value): 
        
                       value = [decimal.Decimal(v) for v in value] 
        
                   else: 
        
                       value = decimal.Decimal(value) 
        
                   self._data[key] = value 
        
               def __len__(self): 
        
                   return len(self._data) 
        
               def __repr__(self): 
        
                   return 'DecimalArray({!r})'.format(self._data) 
        
               @property 
        
               def nbytes(self): 
        
                   n = len(self) 
        
                   if n: 
        
                       return n * sys.getsizeof(self[0]) 
        
                   return 0 
        
               def isna(self): 
        
                   return np.array([x.is_nan() for x in self._data], dtype=bool) 
        
               @property 
        
               def _na_value(self): 
        
                   return decimal.Decimal('NaN') 
        
               @classmethod 
        
               def _concat_same_type(cls, to_concat): 
        
                   return cls(np.concatenate([x._data for x in to_concat])) 
        
           DecimalArray._add_arithmetic_ops() 
        
           DecimalArray._add_comparison_ops()

This doesn't quite get the job done though, as there's currently no way to dynamically alter ExtensionBlock.is_numeric, which currently is always False, and is what get_numeric_data is ultimately looking at for DecimalArray.

Will open a separate issue for the above though, as it's only tangentially related to this issue, and using DecimalArray would really only resolve this issue for the Decimal case (i.e. a more generic solution for object dtype might be nice?).

jreback · 2018-09-19T15:31:34Z

I think this will just work after #22762 (if constructed as a DecimalArray)

adisunw · 2019-09-03T22:28:42Z

Any fix for this? Just ran into this same problem.
Edit: Constructing series as a DecimalArray still does not solve.

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 18.6.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.4
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

rhshadrach · 2023-07-15T17:48:58Z

This was resolved by #46560

WillAyd added this to the Contributions Welcome milestone Aug 11, 2018

WillAyd added the Groupby label Aug 11, 2018

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Sep 19, 2018

mroeschke added the Bug label Jun 21, 2021

ikramersh mentioned this issue Nov 7, 2021

Fixes Issues#44132, #40148, #29033, #22275, #18869: groupby #44142

Closed

8 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach closed this as completed Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decimal fields dropped in group by with more than one column #22275

Decimal fields dropped in group by with more than one column #22275

drudd commented Aug 10, 2018

drudd commented Aug 10, 2018

WillAyd commented Aug 10, 2018

drudd commented Aug 10, 2018

drudd commented Aug 10, 2018

WillAyd commented Aug 11, 2018

drudd commented Aug 11, 2018

drudd commented Aug 11, 2018

jschendel commented Aug 12, 2018

jreback commented Sep 19, 2018

adisunw commented Sep 3, 2019 •

edited

Loading

rhshadrach commented Jul 15, 2023

Decimal fields dropped in group by with more than one column #22275

Decimal fields dropped in group by with more than one column #22275

Comments

drudd commented Aug 10, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

drudd commented Aug 10, 2018

WillAyd commented Aug 10, 2018

drudd commented Aug 10, 2018

drudd commented Aug 10, 2018

WillAyd commented Aug 11, 2018

drudd commented Aug 11, 2018

drudd commented Aug 11, 2018

jschendel commented Aug 12, 2018

jreback commented Sep 19, 2018

adisunw commented Sep 3, 2019 • edited Loading

INSTALLED VERSIONS

rhshadrach commented Jul 15, 2023

Output of `pd.show_versions()`

adisunw commented Sep 3, 2019 •

edited

Loading