Data handling #34

JGarciaCondado · 2024-04-11T09:20:42Z

The software package is dealing currently with tabular data only. However, there is one important aspect that has not been dealt with categorical variables.

To improve this:

We need to add detection of categorical variables in the features, covariate and factors file.
Apply correct handling of theses variables. A commonly used strategy is conversion to one-hot encoding.
In terms of age modelling we should ensure that these are appropriately treated in the scaler.

Another aspect of data handling is data imputation. Currently, any subject with missing data in any of the files submitted is discarded. However, some basic imputation strategies could be implemented.

JGarciaCondado · 2024-07-18T10:49:57Z

We should also allow when naming multiple systems that when we have missing data for one subject for a system but not for another system we should only remove the subject when calculating the age model of that specific system.

JGarciaCondado · 2024-09-30T10:22:13Z

We have also found a new bug/problem. If you upload a .csv with an index that is not numeric an error is thrown. We should test and fix so that files that have a first column named subject with values sub001, sub002, sub003, ... work. Otherwise we should specify that files should have a column called ID (this will avoid less problems and in loading .csv ID column should be made the index). However, we should still ensure that the indices can be random numbers or alphanumeric values.

JGarciaCondado · 2024-11-13T16:19:41Z

When looking at at clinical factors we should not be removing all the subjects that have NaN in a factor. This is because in many studies some subjects have some tests and others others. We are therefore reducing drastically the number of subjects. I would go for an approach where we report the number of subjects used in each factor but keep as many as possible. Imputation here would not be a good strategy.

JGarciaCondado added the enhancement New feature or request label Apr 11, 2024

JGarciaCondado added this to the Release 1.0 milestone Apr 15, 2024

JGarciaCondado added the bug Something isn't working label Sep 30, 2024

itellaetxe self-assigned this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data handling #34

Data handling #34

JGarciaCondado commented Apr 11, 2024

JGarciaCondado commented Jul 18, 2024

JGarciaCondado commented Sep 30, 2024

JGarciaCondado commented Nov 13, 2024

Data handling #34

Data handling #34

Comments

JGarciaCondado commented Apr 11, 2024

JGarciaCondado commented Jul 18, 2024

JGarciaCondado commented Sep 30, 2024

JGarciaCondado commented Nov 13, 2024