Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip-feat: pandas as soft dependency #3384

Closed
wants to merge 10 commits into from

Conversation

mattijn
Copy link
Contributor

@mattijn mattijn commented Mar 25, 2024

This PR is an attempt to make pandas a soft dependency. I hope it can be used as inspiration, as I was not able to make the types happy. I've no real idea how it should be done, but I've been trying a few things, some with success and others without.

I also made an attempt to prioritize the DataFrameLike approach over the pandas routine, but decided to not do this as otherwise usage of a pandas DataFrame within Altair will require pyarrow to infer/serialize. My current feeling is that usage of pandas to infer and serialize the data is still preferred as it is not yet depending on pyarrow.

@binste
Copy link
Contributor

binste commented Mar 29, 2024

Great to get the ball rolling on this, thank you @mattijn! I did not yet have time to review but just wanted to say that I'm happy to have a look at the types once I get to it. As long as the package works, I'm optimistic that we can make mypy happy.

@mattijn
Copy link
Contributor Author

mattijn commented Mar 29, 2024

Thanks @binste! No rush! Maybe something for version 5.4

Copy link
Contributor

@binste binste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some first comments. I haven't had the chance yet to run mypy on this PR (reviewed it in the browser) but I have some ideas how to make it work which I want to try out depending on the errors it throws.



def import_pandas() -> ModuleType:
min_version = "0.25"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment in pyproject.toml? Next to the pandas requirement that if the pandas version is updated, it also needs to be changed here. Although I'm realizing now that that file needs to be changed anyway to make pandas optional

return curried.pipe(data, data_transformers.get())
elif isinstance(data, str):
return {"url": data}
elif _is_pandas_dataframe(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that this line is only reached if it's an old Pandas version which does not support the dataframe interchange protocol? Else it would already stop at line 43, right?

If yes, could you add a comment about this?

@@ -53,6 +52,11 @@ def __dataframe__(
) -> DfiDataFrame: ...


def _is_pandas_dataframe(obj: Any) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function be a simple isinstance(obj, pd.DataFrame)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for start reviewing this PR @binste! I don't think I can do this without importing pandas first.

I tried setting up a function on which I can do some duck typing

def instance(obj):
    return type(obj).__name__

But found out that both polars and pandas are using the instance type DataFrame for their dataframe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but couldn't we call the pandas import function you created in here and if it raises an importerror, we know it's not a pandas dataframe anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pragmatic, I admit. But that would be an unnecessary import of pandas if it is available in the environment, but if the data object is something else.
I wish we could sniff the type without importing modules first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the optional import logic I added to plotly.py a while back: https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/_plotly_utils/optional_imports.py if should_load is False then it won't perform the import even if the library is installed. This was used with isinstance checks, because if pandas hasn't been loaded yet, you know the object you're dealing with isn't a pandas DataFrame, even if pandas is installed.

Copy link
Contributor

@MarcoGorelli MarcoGorelli Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A trick I learned from scikit-learn is to check if pandas is in sys.modules before doing the isinstance check, something like

if (pd := sys.modules.get('pandas')) is not None and isinstance(df, pd.DataFrame):
    ...

If pandas was never imported, then df is definitely not pandas

(this is also what we do in Narwhals, were pandas/polars/etc are never explicitly imported)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just see your response here @MarcoGorelli! I also made this observation recently, see the comment I just added #3384 (comment)...

return pd
except ImportError as err:
raise ImportError(
f"Serialization of the DataFrame requires\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Serialization of the DataFrame requires\n"
f"Serialization of this data requires\n"

It can also be a dict as in data.py: _data_to_csv_string. Furthermore, if it's a dataframe, it's already given that Pandas is installed.

Comment on lines +47 to +49
if TYPE_CHECKING:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if TYPE_CHECKING:
pass

Aware that it's just a wip PR, thought I'd just note it anyway :)

Comment on lines +51 to +53
class _PandasTimestamp:
def isoformat(self):
return "dummy_isoformat" # Return a dummy ISO format string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should inherit from a Protocol as a pd.Timestamp is not an instance of _PandasTimestamp. You'll then also need to add the @runtime_checkable decorator from typing. Also, we could directly test for a pandas timestamp in a similar function to is_pandas_dataframe to keep these approaches consistent?

@@ -4,11 +4,11 @@

import numpy as np
import pandas as pd
from pandas.api.types import infer_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the tests also run without pandas installed so that we can run the whole test suite once with pandas installed and once without. Prevents us from accidentally reintroducing a hard dependency again in the future

@dangotbanned
Copy link
Member

dangotbanned commented Jun 25, 2024

@mattijn just throwing in as a suggestion, have you considered narwhals?

narwhals is quite new but seems promising:

  • The author (@MarcoGorelli) is a maintainer for both pandas and polars
  • It has zero dependencies
  • Uses a single API, which could potentially simplify a lot of altair compatibility code
    • It could be worthwhile to review what they've implemented so far, and to what extent this covers altair's use case
    • I'm likely biased towards it as I'm a big fan of the polars API it is based upon

Even if you were not to go down this route; they collected a range of Issues/PRs in narwhals-dev/narwhals#62 from projects interested in the same topic as this PR - which could prove to be a great resource regardless.

Side note:
I was initially thinking narwhals could help with #3213 (comment), as you could use nw.col - but AFAIK key-completions aren't in there yet.

Related

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jun 25, 2024

Thanks @mattijn @dangotbanned for the ping! Indeed, this is exactly the kind of use-case Narwhals is designed for - happy to help out if there's any interest, any feature request is considered in-scope if it can benefit a project of Altair's caliber!

And if not, no worries, I'm still happy to see that you're going down this route, thanks for all your work here 🙌

@dangotbanned
Copy link
Member

Thanks @mattijn for the ping!

@mattijn if you arrive here confused, I was the one who summoned @MarcoGorelli 😄

@mattijn
Copy link
Contributor Author

mattijn commented Jun 25, 2024

Interesting! Thanks for sharing! This is possibly both interesting for type inference of columns and serialization of the dataframe?

See also related historical issues/PRs:

I think @jonmmease is also able to understand better if this is of interest for altair. What do you think?

@dangotbanned
Copy link
Member

dangotbanned commented Jun 26, 2024

@mattijn

Interesting! Thanks for sharing!

No problem

This is possibly both interesting for type inference of columns

Note

While I was writing this up, @MarcoGorelli opened #3445 so I'm not covering the pyarrow part here.

Original

From my understanding so far, part of this would be solved with translate_dtype.
However that only covers the case where the d/type is known.

altair/altair/utils/core.py

Lines 600 to 651 in 62ab14d

# if data is specified and type is not, infer type from data
if "type" not in attrs:
if pyarrow_available() and data is not None and isinstance(data, DataFrameLike):
dfi = data.__dataframe__()
if "field" in attrs:
unescaped_field = attrs["field"].replace("\\", "")
if unescaped_field in dfi.column_names():
column = dfi.get_column_by_name(unescaped_field)
try:
attrs["type"] = infer_vegalite_type_for_dfi_column(column)
except (NotImplementedError, AttributeError, ValueError):
# Fall back to pandas-based inference.
# Note: The AttributeError catch is a workaround for
# https://github.com/pandas-dev/pandas/issues/55332
if _is_pandas_dataframe(data):
attrs["type"] = infer_vegalite_type(data[unescaped_field])
else:
raise
if isinstance(attrs["type"], tuple):
attrs["sort"] = attrs["type"][1]
attrs["type"] = attrs["type"][0]
elif _is_pandas_dataframe(data):
# Fallback if pyarrow is not installed or if pandas is older than 1.5
#
# Remove escape sequences so that types can be inferred for columns with special characters
if "field" in attrs and attrs["field"].replace("\\", "") in data.columns:
attrs["type"] = infer_vegalite_type(
data[attrs["field"].replace("\\", "")]
)
# ordered categorical dataframe columns return the type and sort order as a tuple
if isinstance(attrs["type"], tuple):
attrs["sort"] = attrs["type"][1]
attrs["type"] = attrs["type"][0]
# If an unescaped colon is still present, it's often due to an incorrect data type specification
# but could also be due to using a column name with ":" in it.
if (
"field" in attrs
and ":" in attrs["field"]
and attrs["field"][attrs["field"].rfind(":") - 1] != "\\"
):
raise ValueError(
'"{}" '.format(attrs["field"].split(":")[-1])
+ "is not one of the valid encoding data types: {}.".format(
", ".join(TYPECODE_MAP.values())
)
+ "\nFor more details, see https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types. "
+ "If you are trying to use a column name that contains a colon, "
+ 'prefix it with a backslash; for example "column\\:name" instead of "column:name".'
)
return attrs

For the infer_vegalite_type cases above, they depend on a pandas C extension function infer_dtype.

narwhals has maybe_convert_dtypes that wraps pandas.NDFrame.convert_dtypes or no-ops.

@MarcoGorelli was this restriction intentional?

import narwhals
import pandas

pandas.DataFrame.convert_dtypes
pandas.Series.convert_dtypes
narwhals.maybe_convert_dtypes # seems to only apply for DataFrame

These altair tests seem to only cover list, which would approximate to pd.Series

@pytest.mark.parametrize(
"value,expected_type",
[
([1, 2, 3], "integer"),
([1.0, 2.0, 3.0], "floating"),
([1, 2.0, 3], "mixed-integer-float"),
(["a", "b", "c"], "string"),
(["a", "b", np.nan], "mixed"),
],
)
def test_infer_dtype(value, expected_type):
assert infer_dtype(value, skipna=False) == expected_type

Overall, these seem like minor, solvable issues to me

@MarcoGorelli
Copy link
Contributor

@MarcoGorelli was this restriction intentional?

as in, the restriction of maybe_convert_dtypes to DataFrame? No reason, we could (and should!) do it for Series too

However that only covers the case where the d/type is known.

could you clarify please? when is the dtype not known?

Perhaps we should have a separate thread to discuss this so as to not risk losing focus on this PR too much. I think Narwhals support is related but orthogonal, and that the simplest way to go about things might be:

  1. get this working without Narwhals (as per this PR)
  2. once its working, evaluate whether Narwhals can help keep down complexity / simplify maintenance

@dangotbanned
Copy link
Member

dangotbanned commented Jun 26, 2024

@MarcoGorelli was this restriction intentional?

as in, the restriction of maybe_convert_dtypes to DataFrame? No reason, we could (and should!) do it for Series too

Yeah that was what I meant.
I wasn't sure if I spotted an easy future extension to narwhals, or that during implementing maybe_convert_dtypes this was considered but rejected for some reason I couldn't see from my brief viewing.

However that only covers the case where the d/type is known.

could you clarify please? when is the dtype not known?

Apologies, maybe that makes more sense expanding to the code prior to block I linked.
That function would be taking a shorthand parameter, which could be a column name with potentially additional information see Encoding Shorthands.
Which is used in combination with any metadata provided by data - that in theory must solve for a generic dataframe with/out datatypes present.

Perhaps we should have a separate thread to discuss this so as to not risk losing focus on this PR too much. I think Narwhals support is related but orthogonal, and that the simplest way to go about things might be:

1. get this working without Narwhals (as per this PR)

2. once its working, evaluate whether Narwhals can help keep down complexity / simplify maintenance

That would be fine with me, @mattijn what are your thoughts on this plan?

@mattijn
Copy link
Contributor Author

mattijn commented Jun 26, 2024

Make sense to open a new issue with the suggestion of utilizing narwhals. Thanks!

Regarding this issue, the recent work within vegafusion to make imports lazy, might also be of interested here. See vega/vegafusion#491.

Especially the approach as such:

pd = sys.modules.get("pandas", None)
pl = sys.modules.get("polars", None)

if pd is not None and isinstance(value, pd.DataFrame):
    ...
if pl is not None and isinstance(value, pl.DataFrame):
    ...

@binste
Copy link
Contributor

binste commented Jun 27, 2024

Great to see all the activity on this topic and thanks to everyone chiming in! :) Regarding narwhals, not sure how it relates to https://github.com/data-apis/dataframe-api but as mentioned by others, best to continue this discussion separately and first strip Pandas out as a hard dependency.

The approach of scikit-learn/vegafusion with sys.modules looks efficient to me so I'd be in favor of adopting that one! @mattijn How would you like to proceed here? Would you prefer a more detailed review or are there some open items you first want to get implemented such as changing to sys.modules?

@mattijn
Copy link
Contributor Author

mattijn commented Jul 15, 2024

Superseded by #3452

@mattijn mattijn closed this Jul 15, 2024
@jonmmease
Copy link
Contributor

Superseded by #3452

Thanks for getting the ball rolling @mattijn!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants