Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data handling #34

Open
JGarciaCondado opened this issue Apr 11, 2024 · 3 comments
Open

Data handling #34

JGarciaCondado opened this issue Apr 11, 2024 · 3 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@JGarciaCondado
Copy link
Contributor

The software package is dealing currently with tabular data only. However, there is one important aspect that has not been dealt with categorical variables.

To improve this:

  • We need to add detection of categorical variables in the features, covariate and factors file.
  • Apply correct handling of theses variables. A commonly used strategy is conversion to one-hot encoding.
  • In terms of age modelling we should ensure that these are appropriately treated in the scaler.

Another aspect of data handling is data imputation. Currently, any subject with missing data in any of the files submitted is discarded. However, some basic imputation strategies could be implemented.

@JGarciaCondado JGarciaCondado added the enhancement New feature or request label Apr 11, 2024
@JGarciaCondado JGarciaCondado added this to the Release 1.0 milestone Apr 15, 2024
@JGarciaCondado
Copy link
Contributor Author

We should also allow when naming multiple systems that when we have missing data for one subject for a system but not for another system we should only remove the subject when calculating the age model of that specific system.

@JGarciaCondado
Copy link
Contributor Author

We have also found a new bug/problem. If you upload a .csv with an index that is not numeric an error is thrown. We should test and fix so that files that have a first column named subject with values sub001, sub002, sub003, ... work. Otherwise we should specify that files should have a column called ID (this will avoid less problems and in loading .csv ID column should be made the index). However, we should still ensure that the indices can be random numbers or alphanumeric values.

@JGarciaCondado JGarciaCondado added the bug Something isn't working label Sep 30, 2024
@itellaetxe itellaetxe self-assigned this Oct 11, 2024
@JGarciaCondado
Copy link
Contributor Author

When looking at at clinical factors we should not be removing all the subjects that have NaN in a factor. This is because in many studies some subjects have some tests and others others. We are therefore reducing drastically the number of subjects. I would go for an approach where we report the number of subjects used in each factor but keep as many as possible. Imputation here would not be a good strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants