Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all imputation methods consistent in regard to encoding requirements #824

Open
nicolassidoux opened this issue Nov 13, 2024 · 6 comments · May be fixed by #827
Open

Make all imputation methods consistent in regard to encoding requirements #824

nicolassidoux opened this issue Nov 13, 2024 · 6 comments · May be fixed by #827
Assignees
Labels
enhancement New feature or request

Comments

@nicolassidoux
Copy link
Collaborator

Description of feature

Imputation requires proper encoding of the data, but at the moment, there is no consistent strategy about what to do if a passed AnnData is not encoded. Some methods throw an exception, while some others silently encode the data with an arbitrary method.

It would be probably better to just throw an exception and let the caller decide what to do instead of altering the data.

Function Encoding required Action if not encoded
explicit_impute No None
simple_impute Strategy-dependant Raise ValueError
knn_impute Yes Perform ordinal encoding
miss_forest_impute Yes Perform ordinal encoding
mice_forest_impute Yes Perform ordinal encoding

So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.

@nicolassidoux nicolassidoux added the enhancement New feature or request label Nov 13, 2024
@nicolassidoux nicolassidoux self-assigned this Nov 13, 2024
@nicolassidoux
Copy link
Collaborator Author

Comments so far (note: I haven’t tested anything yet)

  1. Imports from protected

    • I noticed that prepocessing/_imputation.py contains imports from protected modules or functions:

      from ehrapy.anndata.anndata_ext import _get_column_indices  
      • _get_column_indices is primarily used outside its module and could be valuable to the public API. Since the documentation is already in place, I suggest making it public.
      from ehrapy.core._tool_available import _check_module_importable  
      • _check_module_importable is only used (outside its module) in prepocessing/_imputation.py to check for the presence of sklearnex. If conditional execution based on this is truly necessary, I propose making _check_module_importable public across the entire ehrapy module.

      • For consistency, I also recommend making _shell_command_accessible module-public. To align naming conventions, I renamed _tool_available to _utils_available to match _utils_doc (previously _doc_util).

  2. Return Value Behavior

    • Some functions return an AnnData object only if the copy argument is set. I believe this is a mistake. The hints specifices the functions should consistently return the AnnData object passed to them, regardless of whether it’s copied at the start.
  3. Use of Progress

    • I noticed some Progress are being used, but as far as I know, we don’t have any way to track the progress of imputations. To simplify the code, I recommend removing these.
  4. Improvements to _get_non_numerical_column_indices (prepocessing/_imputation.py)

    • I rewrote _is_float_or_nan to eliminate redundancy
    • I added _is_float_or_nan_row to avoid the need for np.vectorize. This change also resolves the type mismatch warnings caused by the use of np.apply_along_axis with the result of np.vectorize.
  5. Documentation

    • I’m generally skeptical about documenting the exceptions a function can raise unless enforced by a compiler (e.g., in Java).

      • If we want to document all possible exceptions, we’d have to include everything that could be raised by nested calls, which is practically impossible to maintain—especially when external libraries are involved.
      • If we don’t want to do this, what’s the point of an incomplete list?
    • In my opinion, documenting exceptions is only relevant in two cases:

      1. The function itself handles all exceptions from its calls and raises a limited, well-defined set of exceptions.
      2. The function raises custom exceptions that users might need to catch specifically.

@nicolassidoux
Copy link
Collaborator Author

nicolassidoux commented Nov 20, 2024

I’ve updated the branch following my tests.

In test_imputation.py, I added a _base_check_imputation function designed to ensure that every imputation test meets the following criteria:

  • No missing data remains after the imputation.
  • No non-missing data is altered during the imputation process.
  • Only the targeted subset is modified during partial imputations, even if the data in other subsets is missing.

Additionally, I wrote dedicated tests to validate the _base_check_imputation function itself. Test-ception, if you will.

Let me know your thoughts before I proceed with creating a PR.

@eroell
Copy link
Collaborator

eroell commented Nov 23, 2024

  • I am not sure what's the Python convention of importing a private module from a different module within the same package. Do you happen to have a link on that?
  • renaming of protected module names for consistency sounds good to me!
  • The behavior of functions not altering the shape of the anndata (e.g. encode does change it, but normalizations do not) should be that if copy=False, it alters the data inplace (that is, for example in X) and just returns None, and only if copy=True returns a new anndata, with a copied, modified array. Does this make sense to you? Or you have something in mind why it should always return the anndata anyways?
  • No objections
  • Sounds reasonable, will have a close look during review
  • I do fully agree. Removing it from simple_impute and missforest_impute (the only two appearances in the imputation methods I think?) would be nice.

@eroell
Copy link
Collaborator

eroell commented Nov 23, 2024

Regarding the behaviour for non-encoded data:

So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.

Agreed!

@eroell
Copy link
Collaborator

eroell commented Nov 23, 2024

Looking forward very much to your PR! :)

@nicolassidoux
Copy link
Collaborator Author

I am not sure what's the Python convention of importing a private module from a different module within the same package. Do you happen to have a link on that?

For what I understood, a package A has access to a private module in another package B if A is lower in the hierarchy (ie B is somehow a parent of A). In our case A tries to get access to a module of C, but C is not a parent of A, hence the warning.

The behavior of functions not altering the shape of the anndata (e.g. encode does change it, but normalizations do not) should be that if copy=False, it alters the data inplace (that is, for example in X) and just returns None, and only if copy=True returns a new anndata, with a copied, modified array. Does this make sense to you? Or you have something in mind why it should always return the anndata anyways?

Since the hints suggest the function returns an AnnData, so a user could write this:
adata = knn_impute(adata, copy=True)
In this case adata is now None if knn_impute doesn't return anything. Note that we do this kind of assignments in our tests, to check if ids are equal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants