-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conditional sampling and cross-entropy loss #235
Comments
Upon further inspection, I believe there may be a problem in: if condition_column is not None and condition_value is not None:
condition_info = self._transformer.convert_column_name_value_to_id(
condition_column, condition_value)
global_condition_vec = self._data_sampler.generate_cond_from_condition_column_info(
condition_info, self._batch_size) CTGAN/ctgan/synthesizers/ctgan.py Line 443 in 5358af7
The |
Update: I believe a problem may reside in the def generate_cond_from_condition_column_info(self, condition_info, batch):
"""Generate the condition vector."""
vec = np.zeros((batch, self._n_categories), dtype='float32')
id_ = self._discrete_column_matrix_st[condition_info['discrete_column_id']]
id_ += condition_info['value_id']
vec[:, id_] = 1
return vec Line 153 in 5358af7
Specifically, the self._discrete_column_matrix_st = np.zeros(n_discrete_columns, dtype='int32') And does not seem to be changed afterward. Therefore: id_ = self._discrete_column_matrix_st[condition_info['discrete_column_id']] Will always return I believe However, this does not seem to solve the initial issue. I will look further into the conditional generation part, which was my main issue. I noticed that Line 123 in 5358af7
And again, I believe that matrix_st will always be zero. I am not sure whether this may cause any unwanted behavior?
|
Hi @AndresAlgaba, thanks for filing and looking to this. I just wanted to confirm that we've seen this. We can update this issue when we have more bandwidth to debug. If you do end up finding the root cause, please let us know! BTW What is your overall use case for conditional sampling / synthetic data? Even if this conditional vector manipulation may not be working as intended, you can still use a reject sampling-based approach (synthesizing data without any conditions and then throwing way rows you don't need). The SDV library provides convenience wrappers around CTGAN to help you do exactly this. This User Guide may be helpful, particularly the conditional sampling section. |
Hi @npatki, no problem, and thanks for the confirmation! Besides the change from Line 153 in 5358af7
(By the way, I found issue #169 talking about a similar issue with _discrete_column_matrix_st ).
I have found that proper sampling requires the generator to be put in evaluation mode: self._generator.eval() As batch normalization is used in the generator. I have opened a PR with the proposed changes #236. Thank you for the suggestion on the SDV library! An issue (sdv-dev/SDV#623) brought me to examine the conditional sampling :). |
Hi everyone! I have a question/problem regarding the conditional sampling in the
sample
method of theCTGANSynthesizer
using thecondition_column
andcondition_value
arguments. For example, derived from the Usage Example in the README:Note that the whitespace in
condition_value=' Male'
is intentional, see #233 and #234.Environment Details
Problem description
Intuitively, it seems that when a model is sufficiently trained, the conditional sampling should (almost) only generate examples satisfying the criteria given by the conditional vector. To monitor whether this is happening during training, I've printed the cross-entropy loss as follows:
CTGAN/ctgan/synthesizers/ctgan.py
Line 419 in 5358af7
The cross-entropy loss rapidly approaches zero, indicating that the generated examples satisfy the conditional vector criteria during training.
However, when sampling with the
sample
method, the generated samples do not satisfy the criteria substantially more than when no criteria are given (and thus, the empirical distribution is used). I could not find any issues in the code, and was wondering whether my intuition was wrong?What I already tried
I have also done a similar analysis using the test example:
CTGAN/tests/integration/synthesizer/test_ctgan.py
Line 123 in 5358af7
but reached similar results.
The text was updated successfully, but these errors were encountered: