keyerror with higher degrees #21

lpkoh · 2020-05-10T02:10:15Z

Hi,

Thank you so much for this! It's been a life saver. I got your model to run on one of my datasets, but I ran into a problem with higher degrees. With k = 2 and k = 3 models on my dataset, the code ran without bugs at several epsilons up to 2.5, but with k = 4 and higher, for all epsilons, this runs:

================ Constructing Bayesian Network (BN) ================
Adding ROOT accrued_holidays
Adding attribute org
Adding attribute office
Adding attribute start_date
Adding attribute bonus
Adding attribute birth_date
Adding attribute salary
Adding attribute title
Adding attribute gender
========================== BN constructed ==========================
But then the cell just freezes there until keyError (6,5,0,0) occurs

haoyueping · 2020-05-11T05:54:01Z

Hi, thanks for your feedback. In terms of the keyError, can you provide more information? For example, if this keyError still gets raised when epsilon=0? And the complete error report from the python interpreter? Or any information that may be helpful.

lpkoh · 2020-05-12T00:16:28Z

Hmmm very strange. Now I try to recreate the error and it seems to run. However, it takes several hours for one degree 4 Bayesian Network. Is this normal?

Here is another error I faced when trying to recreate the error. Epsilon = 1 and degree of Bayesian Network = 5
`
================ Constructing Bayesian Network (BN) ================
Adding ROOT accrued_holidays
Adding attribute org
Adding attribute office
Adding attribute birth_date
Adding attribute bonus
Adding attribute start_date
Adding attribute salary
Adding attribute title
Adding attribute gender
========================== BN constructed ==========================

TypeError Traceback (most recent call last)
in
6 k=degree_of_bayesian_network,
7 attribute_to_is_categorical=categorical_attributes,
----> 8 attribute_to_is_candidate_key=candidate_keys)
9 describer.save_dataset_description_to_file(description_file + '' +
10 str(epsilon) + '' +\

~\Desktop\Tonic\CodeAndData\CTGAN_TGAN_PB_tests\DataSynthesizer-master/DataSynthesizer\DataDescriber.py in describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed)
178 self.data_description['bayesian_network'] = self.bayesian_network
179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions(
--> 180 self.bayesian_network, self.df_encoded, epsilon / 2)
181
182 def read_dataset_from_csv(self, file_name=None):

~\Desktop\Tonic\CodeAndData\CTGAN_TGAN_PB_tests\DataSynthesizer-master/DataSynthesizer\lib\PrivBayes.py in construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon)
271 else:
272 for parents_instance in product(*stats.index.levels[:-1]):
--> 273 dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist()
274 conditional_distributions[child][str(list(parents_instance))] = dist
275

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
1270 def _getitem_tuple(self, tup: Tuple):
1271 try:
-> 1272 return self._getitem_lowerdim(tup)
1273 except IndexingError:
1274 pass

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
1419 return section
1420 # This is an elided recursive call to iloc/loc/etc'
-> 1421 return getattr(section, self.name)[new_key]
1422
1423 raise IndexingError("not applicable")

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
1270 def _getitem_tuple(self, tup: Tuple):
1271 try:
-> 1272 return self._getitem_lowerdim(tup)
1273 except IndexingError:
1274 pass

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
1371 # we may have a nested tuples indexer here
1372 if self._is_nested_tuple_indexer(tup):
-> 1373 return self._getitem_nested_tuple(tup)
1374
1375 # we maybe be using a tuple to represent multiple dimensions here

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_nested_tuple(self, tup)
1451
1452 current_ndim = obj.ndim
-> 1453 obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
1454 axis += 1
1455

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1962
1963 # fall thru to straight lookup
-> 1964 self._validate_key(key, axis)
1965 return self._get_label(key, axis=axis)
1966

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
1829
1830 if not is_list_like_indexer(key):
-> 1831 self._convert_scalar_indexer(key, axis)
1832
1833 def _is_scalar_access(self, key: Tuple) -> bool:

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _convert_scalar_indexer(self, key, axis)
739 ax = self.obj._get_axis(min(axis, self.ndim - 1))
740 # a scalar
--> 741 return ax._convert_scalar_indexer(key, kind=self.name)
742
743 def _convert_slice_indexer(self, key: slice, axis: int):

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in _convert_scalar_indexer(self, key, kind)
2885 elif kind in ["loc"] and is_integer(key):
2886 if not self.holds_integer():
-> 2887 self._invalid_indexer("label", key)
2888
2889 return key

~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in _invalid_indexer(self, form, key)
3074 """
3075 raise TypeError(
-> 3076 f"cannot do {form} indexing on {type(self)} with these "
3077 f"indexers [{key}] of {type(key)}"
3078 )

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [5] of <class 'int'>
`

haoyueping · 2020-05-17T19:35:38Z

Hi, the error is traced to normalize_given_distribution in utils.py. Can you modify this function as follows, and let me know the input frequencies value raising this error?

def normalize_given_distribution(frequencies):
    try:
        distribution = np.array(frequencies, dtype=float)
        distribution = distribution.clip(0)  # replace negative values with 0
        summation = distribution.sum()
        if summation > 0:
            if np.isinf(summation):
                return normalize_given_distribution(np.isinf(distribution))
            else:
                return distribution / summation
        else:
            return np.full_like(distribution, 1 / distribution.size)
    except:
        raise Exception(f'An error happens when frequencies={frequencies}')

oregonpillow · 2020-05-29T21:08:52Z

I'm also getting key errors.

`================ Constructing Bayesian Network (BN) ================
Adding ROOT workclass
Adding attribute race
Adding attribute sex
Adding attribute education
Adding attribute capital-gain
Adding attribute education-num
Adding attribute marital-status
Adding attribute occupation
Adding attribute relationship
Adding attribute age
Adding attribute fnlwgt
Adding attribute hours-per-week
Adding attribute capital-loss
Adding attribute native-country
Adding attribute income
========================== BN constructed ==========================

KeyError Traceback (most recent call last)
in ()
4 k=degree_of_bayesian_network,
5 attribute_to_is_categorical=categorical_attributes,
----> 6 attribute_to_is_candidate_key=candidate_keys)
7 describer.save_dataset_description_to_file(description_file)

11 frames
/content/gdrive/My Drive/DataSynthesizer/DataSynthesizer/DataDescriber.py in describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed)
178 self.data_description['bayesian_network'] = self.bayesian_network
179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions(
--> 180 self.bayesian_network, self.df_encoded, epsilon / 2)
181
182 def read_dataset_from_csv(self, file_name=None):

/content/gdrive/My Drive/DataSynthesizer/DataSynthesizer/lib/PrivBayes.py in construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon)
271 else:
272 for parents_instance in product(*stats.index.levels[:-1]):
--> 273 dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist()
274 conditional_distributions[child][str(list(parents_instance))] = dist
275

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in getitem(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1270 def _getitem_tuple(self, tup: Tuple):
1271 try:
-> 1272 return self._getitem_lowerdim(tup)
1273 except IndexingError:
1274 pass

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
1378 # instead of checking it as multiindex representation (GH 13797)
1379 if isinstance(ax0, ABCMultiIndex) and self.name != "iloc":
-> 1380 result = self._handle_lowerdim_multi_index_axis0(tup)
1381 if result is not None:
1382 return result

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _handle_lowerdim_multi_index_axis0(self, tup)
1358 # else IndexingError will be raised
1359 if len(tup) <= self.obj.index.nlevels and len(tup) > self.ndim:
-> 1360 raise ek
1361
1362 return None

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _handle_lowerdim_multi_index_axis0(self, tup)
1350 try:
1351 # fast path for series or for tup devoid of slices
-> 1352 return self._get_label(tup, axis=axis)
1353 except TypeError:
1354 # slices are unhashable

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis)
623 raise IndexingError("no slices here, handle elsewhere")
624
--> 625 return self.obj._xs(label, axis=axis)
626
627 def _get_loc(self, key: int, axis: int):

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
3533 index = self.index
3534 if isinstance(index, MultiIndex):
-> 3535 loc, new_index = self.index.get_loc_level(key, drop_level=drop_level)
3536 else:
3537 loc = self.index.get_loc(key)

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in get_loc_level(self, key, level, drop_level)
2816 raise KeyError(key) from e
2817 else:
-> 2818 return partial_selection(key)
2819 else:
2820 indexer = None

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in partial_selection(key, indexer)
2803 def partial_selection(key, indexer=None):
2804 if indexer is None:
-> 2805 indexer = self.get_loc(key)
2806 ilevels = [
2807 i for i in range(len(key)) if key[i] != slice(None, None)

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
2683
2684 if start == stop:
-> 2685 raise KeyError(key)
2686
2687 if not follow_key:

KeyError: (3, 1, 12, 1, 1)`

haoyueping · 2020-05-30T08:00:07Z

Please check out the latest code (commit 9f476eb), and see if this KeyError is fixed.

hamzanaeem1999 · 2021-03-25T07:18:47Z

@haoyueping How do I choose the value of K for constructing a bayesian network because my csv file contains 40 attributes

haoyueping · 2021-03-25T18:43:55Z

@hamzanaeem1999 In theory, a higher value of k makes the Bayesian network more accurate, while a lower value of k reduces the time and space complexity. In practice, you can start with k = 1 and gradually increase k until you find a proper k.

hamzanaeem1999 · 2021-03-25T18:54:21Z

Thanks , kindly answer one more question that If my data is already in numerical form , should there is a need of Categorical attributes to be used . What about epsilon should i increase it too ?

…

On Thu, 25 Mar 2021, 23:44 Haoyue Ping, ***@***.***> wrote: @hamzanaeem1999 <https://github.com/hamzanaeem1999> In theory, a higher value of k makes the Bayesian network more accurate, while a lower value of k reduces the time and space complexity. In practice, you can start with k = 1 and gradually increase k until you find a proper k. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK4UDDGBOMVWLMPH5LXZSMLTFN773ANCNFSM4M5BN2RQ> .

haoyueping · 2021-03-25T19:31:48Z

@hamzanaeem1999 DataSynthesizer works the best for categorical attributes. When it handles numerical values, it uses histograms to model the distribution, so it won't be accurate within each bin of the histogram.

A greater epsilon value corresponds to less noise. So you need to try different epsilon values to make a tradeoff between privacy and utility.

hamzanaeem1999 · 2021-03-25T19:35:26Z

Then why use use only Education for categorical in your git while there are other columns too which are in categorical !

…

On Fri, 26 Mar 2021, 00:32 Haoyue Ping, ***@***.***> wrote: @hamzanaeem1999 <https://github.com/hamzanaeem1999> DataSynthesizer works the best for categorical attributes. When it handles numerical values, it uses histograms to model the distribution, so it won't be accurate within each bin of the histogram. A greater epsilon value corresponds to less noise. So you need to try different epsilon values to make a tradeoff between privacy and utility. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK4UDDACICZMJWGEIEXK7QDTFOFTPANCNFSM4M5BN2RQ> .

haoyueping · 2021-03-26T00:40:01Z

@hamzanaeem1999 DataSynthesizer identifies categorical attributes by the parameter category_threshold. You don't need to explicitly specify each categorical attribute whose domain size is smaller than category_threshold.

haoyueping closed this as completed in 9f476eb May 30, 2020

haoyueping reopened this May 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keyerror with higher degrees #21

keyerror with higher degrees #21

lpkoh commented May 10, 2020

haoyueping commented May 11, 2020

lpkoh commented May 12, 2020 •

edited

Loading

haoyueping commented May 17, 2020

oregonpillow commented May 29, 2020

haoyueping commented May 30, 2020

hamzanaeem1999 commented Mar 25, 2021 •

edited

Loading

haoyueping commented Mar 25, 2021

hamzanaeem1999 commented Mar 25, 2021 via email

haoyueping commented Mar 25, 2021

hamzanaeem1999 commented Mar 25, 2021 via email

haoyueping commented Mar 26, 2021

keyerror with higher degrees #21

keyerror with higher degrees #21

Comments

lpkoh commented May 10, 2020

haoyueping commented May 11, 2020

lpkoh commented May 12, 2020 • edited Loading

haoyueping commented May 17, 2020

oregonpillow commented May 29, 2020

haoyueping commented May 30, 2020

hamzanaeem1999 commented Mar 25, 2021 • edited Loading

haoyueping commented Mar 25, 2021

hamzanaeem1999 commented Mar 25, 2021 via email

haoyueping commented Mar 25, 2021

hamzanaeem1999 commented Mar 25, 2021 via email

haoyueping commented Mar 26, 2021

lpkoh commented May 12, 2020 •

edited

Loading

hamzanaeem1999 commented Mar 25, 2021 •

edited

Loading