Towards "pandas 1.0" #10000

jorisvandenbossche · 2015-04-27T12:05:35Z

Here's our roadmap document: https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit#

Just because it is a nice round number :-)

Or maybe we can use it to discuss how we imagine a possible pandas 1.0 ..

Some clarification (from @shoyer): This is not the place to make new feature requests -- please continue to make separate GitHub issues for those. Almost every new feature can be added without a 1.0 release. If there is a change you think would be necessary to do in pandas 1.0, feel free to reference issues where it is described in more detail.

shoyer · 2015-04-27T19:08:58Z

My wish list for pandas 1.0:

Fix []/__getitem__ (Overview of [] (__getitem__) API #9595)
Make the index/column distinction less painful (ENH/API: clarify groupby by to handle columns/index names #5677, Allowing the index to be referenced by name, like a column #8162)

I also have a fantasy world where the pandas Index becomes entirely optional, but that might be too big of a break even for pandas 1.0.

jorisvandenbossche · 2015-04-27T19:30:25Z

I want to add:

Clean up the Index vs MultiIndex API (Unify index and multindex (and possibly others) API #3268)

jnmclarty · 2015-04-27T19:39:50Z

What if, every pnl, df, s, had a mode, that changed the slicing/getitem behavior. One could set the default in the options, and change it on a per-object basis when necessary? It could allow old-new to transition smoother, plus, get more creative where desired.

shoyer · 2015-04-27T21:16:46Z

@jnmclarty A better option would be some sort of flag that could be set per module, similar to a future statement -- changing the way in which a specific DataFrame is queried is just begging for someone to pass it off to an incompatible function. In fact, I just asked if this is possible on StackOverflow: http://stackoverflow.com/questions/29905278/using-future-style-imports-for-module-specific-features-in-python/

djchou · 2015-04-30T14:52:13Z

It would be nice if there was an option to have boxplot X axis labels match line plot's X axis labels.

shoyer · 2015-04-30T17:09:47Z

@djchou is there an existing issue for that? If not, please make one :).

sinhrks · 2015-05-01T04:13:32Z

Congrats on the great package:D

My wish is:

cythonize internals (Move Block internals code to Cython #163)
Parallelize option for some ops, such as groupby / aggregation

datnamer · 2015-05-01T17:01:38Z

dplyr like macros: https://github.com/dalejung/naginpy

A guy can wish...

TomAugspurger · 2015-05-01T19:30:06Z

I've been working on problems recently where having groupbys run in parallel would have been great (I think). Also maps / applys.

lexual · 2015-06-06T14:24:31Z

ref #1907

toddrjen · 2015-06-09T13:47:09Z

These may be too small, but since this is a wishlist I would like to see some improvements in the consistency of the API. Some example:

More consistent usage of singular vs. plural, for example index/indexes, column/columns, and level/levels. This includes both the names and whether they accept single values, multiple values, or both.
Make sure the axis argument is available wherever operations are applied across along an axis.
Go through related functions and make sure they have the same arguments in the same order. For example, for DataFrame, cumsum has a skip_na argument, while diff doesn't.
If an argument does the same thing as a method, it should have the same name as the method. So for example fill_value should be fillna.
Try to get the use of underscores more consistent. For example, in DataFrame we have sort_index and sortlevel, and is_copy, isin, and isnull.

shoyer · 2015-06-09T14:01:01Z

For the record, I'm strongly -1 on @toddrjen's suggestion to rename methods to make the use of underscores more consistent. Even Python 3 didn't clean things up like that.

bwillers · 2015-06-12T11:23:12Z

Integer columns with missing data support :)

xref #8643

benjello · 2015-06-12T11:55:46Z

Allow "statistics"l function like count, sum, mean, quantile etc to handle weighted data

shoyer · 2015-06-12T18:30:48Z

@bwillers I added a xref to an existing issue where that was discussed

@benjello is there already a github issue for adding weights? If not, please make one :).

@bwillers @benjello The good news is that I don't think either of your suggestions require pandas 1.0. Both could be done incrementally.

benjello · 2015-06-13T09:08:28Z

@shoyer #2501 and #10030 are somehow about weights: should I open a new one ?

jorisvandenbossche · 2015-06-13T10:09:24Z

@benjello I think we can discuss this further at #10030. That issue is now only about the mean, but would be good to discuss there to which methods we would want to add this functionality.

tgarc · 2015-07-14T00:10:00Z

I wasn't entirely sure where to put this but I've written up a short gist as an IPython notebook on the current state of MultiIndexing with DataFrames.loc.

https://nbviewer.jupyter.org/gist/tgarc/6c40a65f648302b6b9d7#

What is particularly relevant to this discussion is in the last section. Specifically pandas allows,

df.loc[('foo','bar'), ('one','two'), ('three','four')] (1)

To be taken to mean

df.loc[(('foo','bar'), ('one','two'), ('three','four')), :]

But this type of indexing is ambiguous in the case when the number of indexing tuples is 2 since

df.loc[('foo','bar'), ('one','two')]

could mean incomplete indexing as in

df.loc[(('foo','bar'), ('one','two')),:]

or row,column indexing. Currently, pandas just interprets this as row, column indexing when there are 2 indexing tuples.

My feeling is that the incomplete indexing as in (1) shouldn't be allowed for MultiIndex DataFrames because of the aforementioned ambiguity. I'm not sure whether changing this would break other code and hence whether it should be considered a change that should be held off until v1.0.

This comment and gist is also a summary of some of the discussion that I had with @shoyer and @jonathanrocher at the SciPy sprints.

toddrjen · 2015-07-14T07:59:09Z

This may or may not be a good idea, but it may at least be worth thinking about. Considering that PanelND has always been marked as "experimental" and not all features support it, and considering the work that has been going on in xray, is PanelND something that could be deprecated or dropped for 1.0?

jorisvandenbossche · 2015-07-14T11:46:25Z

@tgarc Nice overview notebook! (by the way, if you would like to submit parts of that to improve the docs, very welcome!)

Part of what you describe is also discussed here (collapsing index levels or not): #10552

For the allowing of 'incomplete' indexing on frames, there is already a warning in docs for this: http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers (the red warning box). So it is explicitly "allowed, although warned for because of possible ambiguities" (so not a bug in that sense).
But the question is indeed of this is a good idea. It is somewhat convenient that it works in the non-ambiguous cases, but maybe better to not allow this. If we want to discuss this more in detail, probably better to open a separate issue.

jreback · 2015-07-14T12:41:24Z

@toddrjen

@shoyer and I have had discussion about this. The proposal is to rename Xray -> pandas-nd. We can discuss further consolidation at some later point. I think we would then deprecate Panelnd (e.g. 4D and higher) and point to pandas-nd. Their are a couple of API issues if we also did this for Panel.

Mainly I think we would need some conversions, e.g. to_nd as a mainline function.

jreback · 2015-07-14T12:42:54Z

@tgarc this was added quite a long time ago as a convenience / magic feature. It is specifically warned about and is a limitation of the python syntax.

There are times when it can be detected and other times it is ambiguous. I am not sure that we can do anything about it. If people don't read the docs what can you do.

tgarc · 2015-07-14T20:02:52Z

@jorisvandenbossche Thanks, I'll look to see if there's an appropriate place to add documentation. Thanks for pointing me to that warning - I admit I didn't know it was there.

@jreback I realize that this is an established feature and that there is a warning about it in the docs but as we were discussing pulling back on the complexity of indexing in the future of pandas, modifying this particular feature seemed like a good opportunity to simplify existing code and restrict the number of ways users can do multi-indexing. I'll give this some more thought and potentially open as a new issue.

EDIT opened as issue #10574

supersedes #11950 xref #10000 Author: Jeff Reback <[email protected]> Closes #11972 from jreback/xarray and squashes the following commits: 85de0b7 [Jeff Reback] ENH: add to_xarray conversion method

supersedes pandas-dev#11950 xref pandas-dev#10000 Author: Jeff Reback <[email protected]> Closes pandas-dev#11972 from jreback/xarray and squashes the following commits: 85de0b7 [Jeff Reback] ENH: add to_xarray conversion method

jorisvandenbossche · 2016-07-27T22:53:33Z

Pinging here on github as well, as I am not sure everybody is aware of the pandas-dev mailing list. But there is currently a thread started by Wes on a pandas 1.0 / future roadmap, and you are certainly welcome to also provide feedback or share ideas.

https://mail.python.org/pipermail/pandas-dev/2016-July/000512.html

cc @chris-b1 @gfyoung @MaximilianR @kawochen @JanSchulz

shoyer · 2016-07-29T17:13:28Z

One other major breaking change to consider:

We should consider making arithmetic between a Series and a DataFrame broadcast across the columns of the dataframe, i.e., aligning series.index with df.index, rather than the current behavior aligning series.index with df.columns.

I think this would be far more useful than the current behavior, because it's much more common to want to do arithmetic between a series and all columns of a DataFrame. This would make broadcasting in pandas inconsistent with NumPy, but I think that's OK for a library that focuses on 1D/2D rather than N-dimensional data structures.

TomAugspurger · 2016-08-10T18:19:28Z

Some questions for the next couple releases...

Is the idea for 1.0 to stabilize the 0.x API, or to drop a handful of larger API-breaking changes? Or are we pushing the API-breaking changes (e.g. fixing __getitem__) till 2.0?

Actually, that's really my only question. I guess the only followup would be "what falls into that bucket of large API-breaking changes that are actually feasible?"

I think now that 1.0 is upon us, we should refocus this issue from "wishlist" to "stuff that's actually going to happen for 1.0". As we go through issues prepping for 0.19, what's our policy on pushing issues' milestones? Do we push to "1.0" or "Someday"? I'd lean towards "Someday", and only change use 1.0 for stuff that's blockers.

jorisvandenbossche · 2016-08-10T23:05:48Z

Is the idea for 1.0 to stabilize the 0.x API, or to drop a handful of larger API-breaking changes? Or are we pushing the API-breaking changes (e.g. fixing getitem) till 2.0?

As it is now discussed on the pandas-dev mailing list, I think the conclusion is indeed how you state it here: 1.0 as a stabilization of the current 0.x API, and 2.0 with an internal refactor / larger API changes (eg getitem)

we should refocus this issue from "wishlist" to "stuff that's actually going to happen for 1.0"

I think what is discussed in this issue is actually what we now are discussing as changes for 2.0, so I would rather change the milestone, and open an new issue for things we want to do before 1.0

As we go through issues prepping for 0.19, what's our policy on pushing issues' milestones? Do we push to "1.0" or "Someday"? I'd lean towards "Someday", and only change use 1.0 for stuff that's blockers.

+1, there is also 'next major release', that is often used in the past to push issues to that are not included anymore in the current release. But indeed, I would not rename automatically all issues of 'next major release' to '1.0', but keep the '1.0' milestone to selectively add to issues that we regard as blockers for 1.0

jreback · 2016-08-10T23:41:37Z

here's why I have the tags set this way. We have approx 1000 issues under next major release. This is really just a placeholder for things to do, that otherwise are not categorized as pie-in-sky Someday.

The way things have been working is to pull issues off of this to a numbered release. IOW, when someone submits a pull-request I mark the issue. Then when the PR is actually merged it gets set with the version number. Otherwise you get a bunch of stale PR's that have version numbers and you have to then go back and manually unassign them.

Same thing with issues. Before I switched to this way (IIRC was 0.15 or 0.16). I would would have to manually go thru each each and reassign it (well, often did it in bulk, but the idea was to review open issues). The 'issue' is that we have a LOT of open issues. They are only semi-prioritized. Prioritizing is quite difficult as resources are not generally available (IOW, there aren't people to 'assign' issues, rather its the reverse, people 'assign' them to themselves).

So generally newish issues I would assign to the current version number, as time closes to the release, I would push newer issues to next major release. Then would still review open issues that have a version number and push / request help.

This activity get's quickie bugs fixed, while allowing some semblance of 'newish' issues (IOW those that happened recently).

Of course if anyone has better suggestions on how to manage issues. speak up!

wesm · 2016-08-10T23:52:11Z

pandas has basically been operating in Kanban style since its beginning. Issues are marked as "on deck" (here: "next major release" -- perhaps we could give this a better name like "approved", "on deck", "fair game" -- some issues may be either pie-in-the-sky or have not yet reached consensus about the path forward) with potentially an additional level of prioritization (e.g. blocker)

It may be a good idea to start thinning down the 1.0 TODO list to things that absolutely must get done. We also need to figure out a procedure for maintaining both a 1.x maintenance branch as well as an unstable 2.0 development branch. I believe that the 2.0 branch can be made to cleanly rebase until the first cut of the internals (libpandas + wrapper classes) stabilizes (which will likely take on the order of months) and can begin to be integrated into pandas/core. At some point a more serious divergence will have to take place, at which point "forward-porting" bug fixes may become complicated.

dragonator4 · 2017-08-13T03:07:35Z

Proper units support would be a good thing for 1.0: #10349. I think @jreback's idea of using the dtype is very organic and awesome.

IMHO, it is OK to break considerable backwards compatibility with a huge release, which in this case would be a culmination of lessons learned, feature additions, etc. There was no way all the current capabilities, and the pending feature requests, bug fixes and enhancements could have been planned for at the time of creation of pandas. Since so much has been bolted on with occasional API changes, as required, there are quite a few inconsistencies in implementation. 1.0 can be a way to organically build up all features from a single trunk. If you need my opinion, I am in favor of libpandas, because I see it as a door to independent development in Python and other languages. You all are better at figuring this out though. Users can always freeze/force older versions in environments to avoid code breakage.

h-vetinari · 2018-11-18T15:11:15Z

Now that there is an actual plan for 1.0 release (i.e. v.0.24 -> v.0.25 -> v.1.0), some of this might be too ambitious, but essentially, those are all about consistency (or lack thereof) that I'd like to see in pandas 1.0:

consistent API for IO methods API: formalize the pandas IO API #15862, API: Unify compression-kwarg for IO-methods #21640
consistent output of groupby.apply: API/DOC: clean up DataFrame.groupby.apply #22545, Inconsistent groupby-apply output shape and random values returned. #20420, BUG: index of group not returned correctly in groupby.apply #22541, API: groupby aggregation with apply does not drop groupby-column #22542, BUG: weird behaviour for returning group in groupby.apply #22546
all about unique: consistent (i.e. pandas can deal with its own types, both as class methods and as pd.unique, and maintains the type of the caller), with possibility to return inverse, API/ENH: overhaul/unify/improve .unique #22824, ENH: adding .unique() to DF (or return_inverse for duplicated) #21357, API: provide a better way of doing np.unique(return_inverses=True) #4087

h-vetinari · 2019-01-30T07:27:13Z

I know that most people can't wait to finally have pandas 1.0, but IMO there are some very fundamental parts of the API that should still stabilize some more:

what exactly can be a label - some possibilities that have popped up in recent discussions:
scalars / scalars or tuples / anything hashable & sortable / anything hashable / ...?
This should be defined, documented, tested and enforced. See Regression in DataFrame.set_index with class instance column keys #24969 DEPR/API: tuples in labels #24688
@toobaz @WillAyd
what exactly is "list-like" - this should likewise be defined, documented, tested and enforced. See API: set should not be considered list_like #23061 is_list_like should return false for tuples #24702
since 0.24, the whole interaction with numpy is starting to change - i.e. using .array instead of .values and explicitly calling .to_numpy() to get an ndarray. This will very likely need some further maturation.

These three points concern some of the most fundamental aspects of the API surface, and leaving them muddy means it will be much harder to fix after 1.0, because many people will be shouting "SemVer!", whether that's the policy or not.

Going over the thread, there's also some very good points brought up that have not been addressed yet.

To be sure, there's been a lot of progress (EAs will have a huge impact for good), but even though I'm raining on the parade, I think it's a necessary discussion. At the very least, there needs to be clear communication what the policy for breaking changes & versioning is going to be post-1.0. - for example numpy-style rolling deprecations, similar to the current MO?

I believe that SemVer would either lead to massive ossification, or alternatively, that the current minor releases (like 0.23 -> 0.24) would always have to be major version bumps every ~6 months (which would be a valid choice too), at least for the foreseeable future.

simonjayhawkins · 2019-01-30T11:24:46Z

would always have to be major version bumps every ~6 months (which would be a valid choice too)

if that was the expectation then would <year>.<month>.<patch> versioning with a 6 month release cycle be more appropriate than semver?

Towards Pandas 20.1 FTW!

rgommers · 2019-02-10T22:27:11Z

Is the Google Doc linked in the description currently the best publicly available Pandas roadmap? Or https://pandas-dev.github.io/pandas2/goals.html#id1? Or is it all so out of date that it's better to state that there currently isn't any roadmap?

TomAugspurger · 2019-02-10T22:38:54Z

https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)#towards-pandas-10 is probably the most up to date, though there are already some inaccurate items.

0.24.0 was just released in January, so 0.25.0 will be a few months from now, and 1.0 sometime in the middle of the year (perhaps at SciPy?)

rgommers · 2019-02-11T00:03:02Z

Thanks @TomAugspurger!

datapythonista · 2019-07-19T16:20:47Z

Probably worth referencing this PR adding a roadmap here: #27478

TomAugspurger · 2019-12-30T14:07:19Z

@jorisvandenbossche is there anything concrete in this issue that isn't recorded elsewhere? We'll need to re-title it soon :)

Is there anything here that's a blocker / nice-to-have for 1.0?

jangorecki · 2020-03-07T04:50:03Z

Shouldn't this issue be already resolved/obsolete by recent release of pandas 1.0.0?

mroeschke · 2020-04-03T03:51:44Z

Since pandas 1.0 has already been released, I think we are safe to close this issue. We may want to continue discussion on a new "Towards pandas 2.0" issue. Closing for now

jorisvandenbossche changed the title ~~Our 10,000th issue!~~ Towards "pandas 1.0" May 29, 2015

jorisvandenbossche added this to the 1.0 milestone Jun 8, 2015

jorisvandenbossche mentioned this issue Jun 8, 2015

1.0 Release #1907

Closed

shoyer mentioned this issue Jun 11, 2015

SciPy 2015 conference birds-of-a-feather session #10333

Closed

benjello mentioned this issue Jun 13, 2015

weighted mean #10030

Open

max-sixty mentioned this issue Oct 30, 2015

Partial indexing of a Panel #8906

Closed

jreback added the Community label Dec 30, 2015

toobaz mentioned this issue Feb 14, 2016

BUG: Make dict iterators real iterators, provide "next()" in Python 2 #12299

Closed

jorisvandenbossche mentioned this issue Jul 5, 2016

Deprecation of Panel ? #13563

Closed

shoyer mentioned this issue Sep 26, 2016

Aligning Series.index with DataFrame.index in broadcasting operations wesm/pandas2#30

Open

TomAugspurger mentioned this issue Dec 27, 2016

ENH: New short indexer for operating on values #14976

Closed

ololobus mentioned this issue Jun 22, 2019

Performance drop and MemoryError during insert and _consolidate_inplace #26985

Closed

jorisvandenbossche mentioned this issue Aug 1, 2019

API: Meta-issue for making consistent API's to refer to column names and index names #27652

Open

TomAugspurger modified the milestones: 1.0, Contributions Welcome Jan 2, 2020

mroeschke closed this as completed Apr 3, 2020

Towards "pandas 1.0" #10000

Towards "pandas 1.0" #10000

Comments

jorisvandenbossche commented Apr 27, 2015

shoyer commented Apr 27, 2015

jorisvandenbossche commented Apr 27, 2015

jnmclarty commented Apr 27, 2015

shoyer commented Apr 27, 2015

djchou commented Apr 30, 2015

shoyer commented Apr 30, 2015

sinhrks commented May 1, 2015

datnamer commented May 1, 2015

TomAugspurger commented May 1, 2015

lexual commented Jun 6, 2015

toddrjen commented Jun 9, 2015

shoyer commented Jun 9, 2015

bwillers commented Jun 12, 2015

benjello commented Jun 12, 2015

shoyer commented Jun 12, 2015

benjello commented Jun 13, 2015

jorisvandenbossche commented Jun 13, 2015

tgarc commented Jul 14, 2015

toddrjen commented Jul 14, 2015

jorisvandenbossche commented Jul 14, 2015

jreback commented Jul 14, 2015

jreback commented Jul 14, 2015

tgarc commented Jul 14, 2015

jorisvandenbossche commented Jul 27, 2016

shoyer commented Jul 29, 2016

TomAugspurger commented Aug 10, 2016

jorisvandenbossche commented Aug 10, 2016

jreback commented Aug 10, 2016

wesm commented Aug 10, 2016 • edited Loading

dragonator4 commented Aug 13, 2017

h-vetinari commented Nov 18, 2018

h-vetinari commented Jan 30, 2019 • edited Loading

simonjayhawkins commented Jan 30, 2019

rgommers commented Feb 10, 2019

TomAugspurger commented Feb 10, 2019

rgommers commented Feb 11, 2019

datapythonista commented Jul 19, 2019

TomAugspurger commented Dec 30, 2019

jangorecki commented Mar 7, 2020

mroeschke commented Apr 3, 2020

wesm commented Aug 10, 2016 •

edited

Loading

h-vetinari commented Jan 30, 2019 •

edited

Loading