Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No guarantee of > 1 present values in all training set rows #3

Open
lincoln-harris opened this issue Apr 11, 2022 · 0 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@lincoln-harris
Copy link
Collaborator

lincoln-harris commented Apr 11, 2022

Rows in the training set(s) could be composed entirely of missing values.

utilities.split implements a check to ensure that all rows selected for the training set have above some threshold of present values. However, using this function often excludes rows, thereby reducing the number of rows in the training, validation and test sets.

For the last step of the ms_imputer workflow, we want to impute missing values in the original (i.e. non-partitioned) matrix. To do this, we need to have trained on a matrix of equivalent size.

Right now, the present values check in utilities.split is turned off. This ensures that the training, validation and test matrices are the same size as the initial matrix. So the workflow completes successfully, however, there could be weirdness due to the model attempting to learn from rows that are completely np.nans.

There's a valid question of how much this actually matters. Probably the smart thing to do is to evaluate different NMF models trained with slightly different present value threshholds and see how much the reconstruction error changes.

@lincoln-harris lincoln-harris added the enhancement New feature or request label Apr 11, 2022
@lincoln-harris lincoln-harris self-assigned this Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant