Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: Ability to use nw.col / Expr for .group_by #1385

Open
srivarra opened this issue Nov 16, 2024 · 2 comments
Open

[Enh]: Ability to use nw.col / Expr for .group_by #1385

srivarra opened this issue Nov 16, 2024 · 2 comments

Comments

@srivarra
Copy link
Contributor

srivarra commented Nov 16, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

I'm writing a groupby method for annsel where we can group by one, or both of the following dataframes in the AnnData object: Variables (var) and Observations (obs).

Please describe the purpose of the new feature or describe the problem to solve.

I currently have both filtering and selecting which both make use of classes with respect to the dataframe where the query is being applied to. These wrap and return nw.col Expressions. These work, but with group_by it only expects strings. It would be nice if I could use nw.col(*names).

For example:

import narwhals as nw
from narwhals.group_by import GroupBy as NwGroupby

@nw.narwhalify
def _groupby_observation_df(df: IntoDataFrame, expr: IntoExpr) -> NwGroupby:
    return df.group_by(expr)


_groupby_observation_df(obs, nw.col("Cluster_ID"))

gives me the the following error: TypeError of 'Expr' object is not callable

This would help me keep my internal API consistent.

Suggest a solution if possible.

Maybe a special case for nw.col? Or a way to make it a hashable perhaps?

Perhaps a solution is to make nw.col an instantiation of it's own class Col such as how Polars does it.

If you have tried alternatives, please describe them below.

The alternative that I've tried is to strictly use strings, which works, but isn't ideal.

Additional information that may help us understand your needs.

Here is some additional context to my workflow.

from collections.abc import Callable, Iterable
from typing import Any
import narwhals as nw
from narwhals.utils import flatten

def _with_names(func: Callable) -> Callable:
    def wrapper(plx: Any, *names: str | Iterable[str]) -> Any:
        return plx.col(*flatten(names))

    return wrapper


@_with_names
def _func(plx: Any, *names: str | Iterable[str]) -> Any:
    return plx.col(*names)


class ObsExpr(nw.Expr):
    """A Obs DataFrame wrapper for the `narwhals.Expr` class."""

    def __init__(self, call: Callable[[Any], Any]) -> None:
        super().__init__(call)


class ObsCol:
    """Select columns from the :obj:`~anndata.AnnData.obs` DataFrame of an :obj:`~anndata.AnnData` object."""

    def __call__(self, *names: str | Iterable[str]) -> ObsExpr:
        """Select columns from the :obj:`~anndata.AnnData.obs` DataFrame of an :obj:`~anndata.AnnData` object.

        This is a wrapper around the `narwhals.col` function


        Parameters
        ----------
        names
            The names of the obs columns to select.

        Returns
        -------
        A `narwhals.Expr` object representing the selected columns.
        """
        return ObsExpr(lambda plx: _func(plx, *names))

I then run a match-case against the Expr subtype and collect those for var, obs and other DataFrames within AnnData. Depending on the operation some get executed on their respective DataFrame.

obs_col = ObsCol()

exprs = [obs_col(["Cluster_ID"])]


for expr in exprs:
	match f:
		case ObsCol():... # run obs_col() expression on the Obs DataFrame
		case VarCol():... # run var_col() expression on the Var DataFrame
@MarcoGorelli
Copy link
Member

Thanks for the request!

Just to make sure I've understood - you want to be able to do df.group_by(nw.col('cluster_id')) instead of df.group_by('cluster_id')? Could you clarify why please?

@srivarra
Copy link
Contributor Author

Just to make sure I've understood - you want to be able to do df.group_by(nw.col('cluster_id')) instead of df.group_by('cluster_id')? Could you clarify why please?

Yes that is correct.

I'd like to be able to do this in order to be able to differentiate various queries a user would perform on an AnnData structure. You could think of it as multiple DataFrames stuck together. Currently I have different "columns" associated to a particular DataFrame. For example a column for the Variables DataFrame is a callable class VarCol which returns it's own VarExpr Expression subclass.This implementation allows the user to dictate which queries happen on which DataFrames within AnnData.

These are classes which wrap the functionality of nw.col (returns the a subclass Expression with the flattening names) as a callable class and has made it easy to split operations on multiple DataFrames within an AnnData object based on the Expression class (one for each type of column). My inspiration for this was the Polar's col which implements Col as a callable class, and then instantiates it for users. In my implementation, a user doesn't use VarCol, they use the callable object var_col = VarCol() and pass in names of columns like so var_col(*names). If I just use nw.col my api doesn't know what DataFrame to apply that general Expression to.

Other Narwhals methods such as DataFrame.filter and DataFrame.select also work with nw.col which allows me to apply those methods across the various DataFrames in AnnData and has made organizing user provided operations easier. Polar's implementation of group_by also accepts Expressions as input (i.e. pl.col).

If I just use strings then I'd have to create separate logic different than using methods such as select and filter.

Heres an example of what I'm able to do with this
import annsel as an

# assume we have some AnnData object

adata.an.filter(
    an.obs_col(["Cell_label"] == "Lymphomyeloid prog",
	an.var_col(["feature_type"]).is_in(["protein_coding", "lncRNA"]))
)

Here I am filtering the Observation DataFrame based on Cell_label, and then filtering the Variables DataFrame based on feature_type.

Because obs_col and var_col return Expr subclasses ObsExpr and VarExpr respectively, I can take advantage of narwhals' functionality for the two DataFrames by pattern matching on them and passing the respective Expressions to a Narwhals filter method.

I hope this helps clarify why this feature would be useful for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants