Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python datetime.date data type is handled as str and datatype handling in general #64

Open
efstathios-chatzikyriakidis opened this issue Feb 23, 2024 · 0 comments

Comments

@efstathios-chatzikyriakidis
Copy link

efstathios-chatzikyriakidis commented Feb 23, 2024

Hi @avsolatorio,

In one of my recent tests, I utilized a dataframe containing Python's datetime.date within a column. As expected, Pandas identifies this column's data type as object, a common outcome for anything beyond int/float/datetime.

I believe it's important not to automatically convert object types like this into strings, as it can lead to issues. For instance, in my case, the model learns to treat the date column as a string, resulting in string outputs rather than preserving the original datatype or generating new values. Converting Python's dates to timestamps could offer a solution, treating them as numerical data and enabling the generation of new values for synthetic data.

Past experiences (#33, #31, #36) have shown us various issues related to datatype handling. To address these comprehensively, I propose implementing a dictionary parameter in the fit method or REalTabFormer constructor. This parameter would allow us to specify which columns are categorical, regardless of their datatype. If a column is marked as categorical, the library would treat it as a string (regardless of its true datatype), preserving the original datatype, generating new values, and returning the new values back to the user with the original datatype, rather than strings. If a column is not mark as categorical in order to be treated as such it has to be a string column. In case the column is not categorical (e.g. int, float, timestamp, date, whatever), we need to handle it as non-string, generate new values, and preserving the original datatype in the output.

The following requirements aim to address many of the issues we've encountered:

  • Preservation of the original Python datatype in synthetic data, rather than relying solely on Pandas' types.
  • Ability to designate columns as categorical explicitly through the dictionary parameter.
  • Avoidance of attempting to parse string columns into other datatypes like int, treating them as strings instead. A string is a string.
  • Trusting the datatype information inherent in the data itself over Pandas' interpretation.

What do you think on this new functionality? I don't know how much difficult this is as it might affect many things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant