Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction error on hackett.ipynb #10

Open
mcnaughtonadm opened this issue Nov 21, 2023 · 0 comments
Open

Reproduction error on hackett.ipynb #10

mcnaughtonadm opened this issue Nov 21, 2023 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@mcnaughtonadm
Copy link
Collaborator

In the exercise of re-implementing the EMLL example notebooks with PyTensor, we encountered a small bug in reproduction.

The cell:

with sns.plotting_context('notebook'):

    fig = plt.figure(figsize=(40,8))
    ax = fig.add_subplot(111, aspect='equal', adjustable='box')

sns.heatmap(fcc_med_measured[fcc_consistent_measured].values,
            center=0, robust=True, cmap='RdBu', cbar=True, rasterized=True, lw=1)

_= ax.set_yticks(np.arange(sorted_corr_df.shape[0]) + 0.5)
_= ax.set_xticks(np.arange(sorted_corr_df.shape[1]) + 0.5)
_ = ax.set_yticklabels(sorted_corr_df.index, rotation=0)
_ = ax.set_xticklabels(sorted_corr_df.columns)

ax.set_xlabel('Enzyme')
ax.set_ylabel('Boundary Flux')

plt.tight_layout()

yields the figure:
image

whereas the original PyMC3 & Theano implementation yields the figure:
image

The data populating this figure comes from the code snippet:

fccs_hpd = pm.hpd(fccs)
fcc_consistent = np.sign(fccs_hpd[:, :, 0]) == np.sign(fccs_hpd[:, :, 1])
fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)

sorted_corr_df = sort_df(corr_df)

fcc_med_measured = fcc_med.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)
fcc_consistent_measured = fcc_consistent_df.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)

which currently takes the fccs with a shape of (500,240,240) - which is the same in both new and old versions - and calculates the highest posterior density using pymc3.hpd(fccs) in the original implementation but pymc.hdi(fccs) for highest density interval in the new implementation. These methods are in all respects the same at this level of use, but replacing pm.hpd(fccs) with pm.hdi(fccs) in the above snippet returns the error:

/Users/mcna892/mambaforge/envs/emll_dev/lib/python3.11/site-packages/arviz/data/base.py:221: UserWarning: More chains (500) than draws (240). Passed array should have shape (chains, draws, *shape)
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[298], line 4
      2 fccs_hdi_repeated = np.repeat(fccs_hpd[np.newaxis, ...], 500, axis=0)
      3 fcc_consistent = np.sign(fccs_hdi_repeated[:, :, 0]) == np.sign(fccs_hdi_repeated[:, :, 1])
----> 4 fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)
      6 sorted_corr_df = sort_df(corr_df)
      8 fcc_med_measured = fcc_med.reindex(
      9     columns=sorted_corr_df.columns, index=sorted_corr_df.index)

File ~/mambaforge/envs/emll_dev/lib/python3.11/site-packages/pandas/core/frame.py:782, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    771         mgr = dict_to_mgr(
    772             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    773             # attribute "name"
   (...)
    779             copy=_copy,
    780         )
    781     else:
--> 782         mgr = ndarray_to_mgr(
    783             data,
    784             index,
    785             columns,
    786             dtype=dtype,
    787             copy=copy,
    788             typ=manager,
    789         )
    791 # For data is list-like, or Iterable (will consume into list)
    792 elif is_list_like(data):

File ~/mambaforge/envs/emll_dev/lib/python3.11/site-packages/pandas/core/internals/construction.py:336, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    331 # _prep_ndarraylike ensures that values.ndim == 2 at this point
    332 index, columns = _get_axes(
    333     values.shape[0], values.shape[1], index=index, columns=columns
    334 )
--> 336 _check_values_indices_shape_match(values, index, columns)
    338 if typ == "array":
    339     if issubclass(values.dtype.type, str):

File ~/mambaforge/envs/emll_dev/lib/python3.11/site-packages/pandas/core/internals/construction.py:420, in _check_values_indices_shape_match(values, index, columns)
    418 passed = values.shape
    419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (500, 240), indices imply (240, 240)

The problem stems from

fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)

as using r_labels for both columns and index implies a 240 by 240 dataframe, but the data in fcc_consistent is in the shape (500,240) from the sample trace used to calculate this. I tried to force a workaround which produced the incorrect image with the following:

# Initialize an empty array to store HDI results
fccs_hdi = np.zeros((240, 240, 2))

# Calculate HDI for each slice along the first dimension
for i in range(fccs.shape[1]):
    for j in range(fccs.shape[2]):
        hdi = pm.hdi(fccs[:, i, j])  # Compute HDI for each element (across the 500 dimension)
        fccs_hdi[i, j] = hdi  # Store HDI results

fcc_consistent = np.sign(fccs_hdi[:, :, 0]) == np.sign(fccs_hdi[:, :, 1])
fcc_consistent_df = pd.DataFrame(fcc_consistent, columns=r_labels, index=r_labels)

sorted_corr_df = sort_df(corr_df)

fcc_med_measured = fcc_med.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)
fcc_consistent_measured = fcc_consistent_df.reindex(
    columns=sorted_corr_df.columns, index=sorted_corr_df.index)

but this is clearly incorrect. This issue needs to revisit the old implementation with PyMC3 to see what is going on and how we can extend this to using the new data structures present in the latest versions of PyMC with PyTensor.

@mcnaughtonadm mcnaughtonadm added the bug Something isn't working label Nov 21, 2023
@mcnaughtonadm mcnaughtonadm self-assigned this Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant