Make all imputation methods consistent in regard to encoding requirements #824

nicolassidoux · 2024-11-13T09:35:33Z

Description of feature

Imputation requires proper encoding of the data, but at the moment, there is no consistent strategy about what to do if a passed AnnData is not encoded. Some methods throw an exception, while some others silently encode the data with an arbitrary method.

It would be probably better to just throw an exception and let the caller decide what to do instead of altering the data.

Function	Encoding required	Action if not encoded
`explicit_impute`	No	None
`simple_impute`	Strategy-dependant	Raise ValueError
`knn_impute`	Yes	Perform ordinal encoding
`miss_forest_impute`	Yes	Perform ordinal encoding
`mice_forest_impute`	Yes	Perform ordinal encoding

So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.

The text was updated successfully, but these errors were encountered:

nicolassidoux · 2024-11-15T16:11:03Z

Comments so far (note: I haven’t tested anything yet)

Imports from protected
- I noticed that prepocessing/_imputation.py contains imports from protected modules or functions:
```
from ehrapy.anndata.anndata_ext import _get_column_indices  
```
  - _get_column_indices is primarily used outside its module and could be valuable to the public API. Since the documentation is already in place, I suggest making it public.
```
from ehrapy.core._tool_available import _check_module_importable  
```
  - _check_module_importable is only used (outside its module) in prepocessing/_imputation.py to check for the presence of sklearnex. If conditional execution based on this is truly necessary, I propose making _check_module_importable public across the entire ehrapy module.
  - For consistency, I also recommend making _shell_command_accessible module-public. To align naming conventions, I renamed _tool_available to _utils_available to match _utils_doc (previously _doc_util).
Return Value Behavior
- Some functions return an AnnData object only if the copy argument is set. I believe this is a mistake. The hints specifices the functions should consistently return the AnnData object passed to them, regardless of whether it’s copied at the start.
Use of Progress
- I noticed some Progress are being used, but as far as I know, we don’t have any way to track the progress of imputations. To simplify the code, I recommend removing these.
Improvements to _get_non_numerical_column_indices (prepocessing/_imputation.py)
- I rewrote _is_float_or_nan to eliminate redundancy
- I added _is_float_or_nan_row to avoid the need for np.vectorize. This change also resolves the type mismatch warnings caused by the use of np.apply_along_axis with the result of np.vectorize.
Documentation
- I’m generally skeptical about documenting the exceptions a function can raise unless enforced by a compiler (e.g., in Java).
  - If we want to document all possible exceptions, we’d have to include everything that could be raised by nested calls, which is practically impossible to maintain—especially when external libraries are involved.
  - If we don’t want to do this, what’s the point of an incomplete list?
- In my opinion, documenting exceptions is only relevant in two cases:
  1. The function itself handles all exceptions from its calls and raises a limited, well-defined set of exceptions.
  2. The function raises custom exceptions that users might need to catch specifically.

nicolassidoux · 2024-11-20T18:58:56Z

I’ve updated the branch following my tests.

In test_imputation.py, I added a _base_check_imputation function designed to ensure that every imputation test meets the following criteria:

No missing data remains after the imputation.
No non-missing data is altered during the imputation process.
Only the targeted subset is modified during partial imputations, even if the data in other subsets is missing.

Additionally, I wrote dedicated tests to validate the _base_check_imputation function itself. Test-ception, if you will.

Let me know your thoughts before I proceed with creating a PR.

eroell · 2024-11-23T15:10:30Z

I am not sure what's the Python convention of importing a private module from a different module within the same package. Do you happen to have a link on that?
renaming of protected module names for consistency sounds good to me!

The behavior of functions not altering the shape of the anndata (e.g. encode does change it, but normalizations do not) should be that if copy=False, it alters the data inplace (that is, for example in X) and just returns None, and only if copy=True returns a new anndata, with a copied, modified array. Does this make sense to you? Or you have something in mind why it should always return the anndata anyways?

No objections

Sounds reasonable, will have a close look during review

I do fully agree. Removing it from simple_impute and missforest_impute (the only two appearances in the imputation methods I think?) would be nice.

eroell · 2024-11-23T15:11:28Z

Regarding the behaviour for non-encoded data:

So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.

Agreed!

eroell · 2024-11-23T15:12:34Z

Looking forward very much to your PR! :)

nicolassidoux · 2024-11-24T09:26:47Z

I am not sure what's the Python convention of importing a private module from a different module within the same package. Do you happen to have a link on that?

For what I understood, a package A has access to a private module in another package B if A is lower in the hierarchy (ie B is somehow a parent of A). In our case A tries to get access to a module of C, but C is not a parent of A, hence the warning.

The behavior of functions not altering the shape of the anndata (e.g. encode does change it, but normalizations do not) should be that if copy=False, it alters the data inplace (that is, for example in X) and just returns None, and only if copy=True returns a new anndata, with a copied, modified array. Does this make sense to you? Or you have something in mind why it should always return the anndata anyways?

Since the hints suggest the function returns an AnnData, so a user could write this:
adata = knn_impute(adata, copy=True)
In this case adata is now None if knn_impute doesn't return anything. Note that we do this kind of assignments in our tests, to check if ids are equal.

nicolassidoux added the enhancement New feature or request label Nov 13, 2024

nicolassidoux self-assigned this Nov 13, 2024

nicolassidoux mentioned this issue Nov 21, 2024

Encoding consistency with imputations #807

Closed

nicolassidoux linked a pull request Nov 24, 2024 that will close this issue

Make all imputation methods consistent in regard to encoding requirements #827

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make all imputation methods consistent in regard to encoding requirements #824

Make all imputation methods consistent in regard to encoding requirements #824

nicolassidoux commented Nov 13, 2024

nicolassidoux commented Nov 15, 2024

nicolassidoux commented Nov 20, 2024 •

edited

Loading

eroell commented Nov 23, 2024 •

edited

Loading

eroell commented Nov 23, 2024 •

edited

Loading

eroell commented Nov 23, 2024

nicolassidoux commented Nov 24, 2024

Make all imputation methods consistent in regard to encoding requirements #824

Make all imputation methods consistent in regard to encoding requirements #824

Comments

nicolassidoux commented Nov 13, 2024

Description of feature

nicolassidoux commented Nov 15, 2024

Comments so far (note: I haven’t tested anything yet)

nicolassidoux commented Nov 20, 2024 • edited Loading

eroell commented Nov 23, 2024 • edited Loading

eroell commented Nov 23, 2024 • edited Loading

eroell commented Nov 23, 2024

nicolassidoux commented Nov 24, 2024

nicolassidoux commented Nov 20, 2024 •

edited

Loading

eroell commented Nov 23, 2024 •

edited

Loading

eroell commented Nov 23, 2024 •

edited

Loading