Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyerror with higher degrees #21

Open
lpkoh opened this issue May 10, 2020 · 11 comments
Open

keyerror with higher degrees #21

lpkoh opened this issue May 10, 2020 · 11 comments

Comments

@lpkoh
Copy link

lpkoh commented May 10, 2020

Hi,

Thank you so much for this! It's been a life saver. I got your model to run on one of my datasets, but I ran into a problem with higher degrees. With k = 2 and k = 3 models on my dataset, the code ran without bugs at several epsilons up to 2.5, but with k = 4 and higher, for all epsilons, this runs:

================ Constructing Bayesian Network (BN) ================
Adding ROOT accrued_holidays
Adding attribute org
Adding attribute office
Adding attribute start_date
Adding attribute bonus
Adding attribute birth_date
Adding attribute salary
Adding attribute title
Adding attribute gender
========================== BN constructed ==========================
But then the cell just freezes there until keyError (6,5,0,0) occurs

@haoyueping
Copy link
Collaborator

Hi, thanks for your feedback. In terms of the keyError, can you provide more information? For example, if this keyError still gets raised when epsilon=0? And the complete error report from the python interpreter? Or any information that may be helpful.

@lpkoh
Copy link
Author

lpkoh commented May 12, 2020

Hmmm very strange. Now I try to recreate the error and it seems to run. However, it takes several hours for one degree 4 Bayesian Network. Is this normal?

Here is another error I faced when trying to recreate the error. Epsilon = 1 and degree of Bayesian Network = 5
`
================ Constructing Bayesian Network (BN) ================
Adding ROOT accrued_holidays
Adding attribute org
Adding attribute office
Adding attribute birth_date
Adding attribute bonus
Adding attribute start_date
Adding attribute salary
Adding attribute title
Adding attribute gender
========================== BN constructed ==========================

TypeError Traceback (most recent call last)
in
6 k=degree_of_bayesian_network,
7 attribute_to_is_categorical=categorical_attributes,
----> 8 attribute_to_is_candidate_key=candidate_keys)
9 describer.save_dataset_description_to_file(description_file + '' +
10 str(epsilon) + '
' +\

~\Desktop\Tonic\CodeAndData\CTGAN_TGAN_PB_tests\DataSynthesizer-master/DataSynthesizer\DataDescriber.py in describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed)
178 self.data_description['bayesian_network'] = self.bayesian_network
179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions(
--> 180 self.bayesian_network, self.df_encoded, epsilon / 2)
181
182 def read_dataset_from_csv(self, file_name=None):

~\Desktop\Tonic\CodeAndData\CTGAN_TGAN_PB_tests\DataSynthesizer-master/DataSynthesizer\lib\PrivBayes.py in construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon)
271 else:
272 for parents_instance in product(*stats.index.levels[:-1]):
--> 273 dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist()
274 conditional_distributions[child][str(list(parents_instance))] = dist
275

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
1270 def _getitem_tuple(self, tup: Tuple):
1271 try:
-> 1272 return self._getitem_lowerdim(tup)
1273 except IndexingError:
1274 pass

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
1419 return section
1420 # This is an elided recursive call to iloc/loc/etc'
-> 1421 return getattr(section, self.name)[new_key]
1422
1423 raise IndexingError("not applicable")

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
1270 def _getitem_tuple(self, tup: Tuple):
1271 try:
-> 1272 return self._getitem_lowerdim(tup)
1273 except IndexingError:
1274 pass

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
1371 # we may have a nested tuples indexer here
1372 if self._is_nested_tuple_indexer(tup):
-> 1373 return self._getitem_nested_tuple(tup)
1374
1375 # we maybe be using a tuple to represent multiple dimensions here

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_nested_tuple(self, tup)
1451
1452 current_ndim = obj.ndim
-> 1453 obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
1454 axis += 1
1455

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1962
1963 # fall thru to straight lookup
-> 1964 self._validate_key(key, axis)
1965 return self._get_label(key, axis=axis)
1966

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
1829
1830 if not is_list_like_indexer(key):
-> 1831 self._convert_scalar_indexer(key, axis)
1832
1833 def _is_scalar_access(self, key: Tuple) -> bool:

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _convert_scalar_indexer(self, key, axis)
739 ax = self.obj._get_axis(min(axis, self.ndim - 1))
740 # a scalar
--> 741 return ax._convert_scalar_indexer(key, kind=self.name)
742
743 def _convert_slice_indexer(self, key: slice, axis: int):

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in _convert_scalar_indexer(self, key, kind)
2885 elif kind in ["loc"] and is_integer(key):
2886 if not self.holds_integer():
-> 2887 self._invalid_indexer("label", key)
2888
2889 return key

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in _invalid_indexer(self, form, key)
3074 """
3075 raise TypeError(
-> 3076 f"cannot do {form} indexing on {type(self)} with these "
3077 f"indexers [{key}] of {type(key)}"
3078 )

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [5] of <class 'int'>
`

@haoyueping
Copy link
Collaborator

Hi, the error is traced to normalize_given_distribution in utils.py. Can you modify this function as follows, and let me know the input frequencies value raising this error?

def normalize_given_distribution(frequencies):
    try:
        distribution = np.array(frequencies, dtype=float)
        distribution = distribution.clip(0)  # replace negative values with 0
        summation = distribution.sum()
        if summation > 0:
            if np.isinf(summation):
                return normalize_given_distribution(np.isinf(distribution))
            else:
                return distribution / summation
        else:
            return np.full_like(distribution, 1 / distribution.size)
    except:
        raise Exception(f'An error happens when frequencies={frequencies}') 

@oregonpillow
Copy link

I'm also getting key errors.

`================ Constructing Bayesian Network (BN) ================
Adding ROOT workclass
Adding attribute race
Adding attribute sex
Adding attribute education
Adding attribute capital-gain
Adding attribute education-num
Adding attribute marital-status
Adding attribute occupation
Adding attribute relationship
Adding attribute age
Adding attribute fnlwgt
Adding attribute hours-per-week
Adding attribute capital-loss
Adding attribute native-country
Adding attribute income
========================== BN constructed ==========================

KeyError Traceback (most recent call last)
in ()
4 k=degree_of_bayesian_network,
5 attribute_to_is_categorical=categorical_attributes,
----> 6 attribute_to_is_candidate_key=candidate_keys)
7 describer.save_dataset_description_to_file(description_file)

11 frames
/content/gdrive/My Drive/DataSynthesizer/DataSynthesizer/DataDescriber.py in describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed)
178 self.data_description['bayesian_network'] = self.bayesian_network
179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions(
--> 180 self.bayesian_network, self.df_encoded, epsilon / 2)
181
182 def read_dataset_from_csv(self, file_name=None):

/content/gdrive/My Drive/DataSynthesizer/DataSynthesizer/lib/PrivBayes.py in construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon)
271 else:
272 for parents_instance in product(*stats.index.levels[:-1]):
--> 273 dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist()
274 conditional_distributions[child][str(list(parents_instance))] = dist
275

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in getitem(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1270 def _getitem_tuple(self, tup: Tuple):
1271 try:
-> 1272 return self._getitem_lowerdim(tup)
1273 except IndexingError:
1274 pass

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
1378 # instead of checking it as multiindex representation (GH 13797)
1379 if isinstance(ax0, ABCMultiIndex) and self.name != "iloc":
-> 1380 result = self._handle_lowerdim_multi_index_axis0(tup)
1381 if result is not None:
1382 return result

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _handle_lowerdim_multi_index_axis0(self, tup)
1358 # else IndexingError will be raised
1359 if len(tup) <= self.obj.index.nlevels and len(tup) > self.ndim:
-> 1360 raise ek
1361
1362 return None

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _handle_lowerdim_multi_index_axis0(self, tup)
1350 try:
1351 # fast path for series or for tup devoid of slices
-> 1352 return self._get_label(tup, axis=axis)
1353 except TypeError:
1354 # slices are unhashable

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis)
623 raise IndexingError("no slices here, handle elsewhere")
624
--> 625 return self.obj._xs(label, axis=axis)
626
627 def _get_loc(self, key: int, axis: int):

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
3533 index = self.index
3534 if isinstance(index, MultiIndex):
-> 3535 loc, new_index = self.index.get_loc_level(key, drop_level=drop_level)
3536 else:
3537 loc = self.index.get_loc(key)

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in get_loc_level(self, key, level, drop_level)
2816 raise KeyError(key) from e
2817 else:
-> 2818 return partial_selection(key)
2819 else:
2820 indexer = None

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in partial_selection(key, indexer)
2803 def partial_selection(key, indexer=None):
2804 if indexer is None:
-> 2805 indexer = self.get_loc(key)
2806 ilevels = [
2807 i for i in range(len(key)) if key[i] != slice(None, None)

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2683
2684 if start == stop:
-> 2685 raise KeyError(key)
2686
2687 if not follow_key:

KeyError: (3, 1, 12, 1, 1)`

@haoyueping
Copy link
Collaborator

Please check out the latest code (commit 9f476eb), and see if this KeyError is fixed.

@hamzanaeem1999
Copy link

hamzanaeem1999 commented Mar 25, 2021

@haoyueping How do I choose the value of K for constructing a bayesian network because my csv file contains 40 attributes

@haoyueping
Copy link
Collaborator

@hamzanaeem1999 In theory, a higher value of k makes the Bayesian network more accurate, while a lower value of k reduces the time and space complexity. In practice, you can start with k = 1 and gradually increase k until you find a proper k.

@hamzanaeem1999
Copy link

hamzanaeem1999 commented Mar 25, 2021 via email

@haoyueping
Copy link
Collaborator

@hamzanaeem1999 DataSynthesizer works the best for categorical attributes. When it handles numerical values, it uses histograms to model the distribution, so it won't be accurate within each bin of the histogram.

A greater epsilon value corresponds to less noise. So you need to try different epsilon values to make a tradeoff between privacy and utility.

@hamzanaeem1999
Copy link

hamzanaeem1999 commented Mar 25, 2021 via email

@haoyueping
Copy link
Collaborator

@hamzanaeem1999 DataSynthesizer identifies categorical attributes by the parameter category_threshold. You don't need to explicitly specify each categorical attribute whose domain size is smaller than category_threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants