Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use emll to get correct elasticity values on a model with ground truth values #24

Open
janisshin opened this issue Jul 2, 2024 · 14 comments
Assignees

Comments

@janisshin
Copy link
Collaborator

I've tried using the improved emll version on a modified yeast model to try and get elasticity values and compare them to ground truth values. Here is the notebook with my work: notebook link. The ELBO convergence plot in my notebook does not seem great. Am I using emll correctly and is this the closest that emll can get to the ground truth values?

Here is the model referenced in the notebook and the data.

@janisshin
Copy link
Collaborator Author

@mcnaughtonadm do you have any updates?

@mcnaughtonadm
Copy link
Collaborator

I do have some updates @janisshin, sorry I thought I'd have a bit more time to test a few different solutions to your problem but had an annoying paper revision that took up some extra time 🙄.

So I went through and tried to run your code the way that I would go about running the emll workflow. I'm not sure how things work with tellurium and your SBML model but it seems like there may be some issues stemming from how you are building up the requisite components for the pymc_model and how they translate to the inference.

Things like defining Ey_t as a second pm.Deterministic variable, or your HalfNormal error priors may be causing some additional complexity with the inference causing some weird artifacts early on and leading to the funky ADVI, so I'd like to try and get to the point right before getting to the pymc_model building.

I tried to get to the step of building up the model, but ran into some problems doing it my way versus how tellurium iterates through the data and constructs things like the steady state and boundaries, do you think you could potentially cloudpickle the variables leading up to your pymc_model construction? Something like:

import cloudpickle
import gzip

with gzip.open("pymc_data.pgz", "wb") as f:
            cloudpickle.dump(
                {
                    "vn": self.vn,
                    "en": self.en,
                    "xn": self.xn,
                    "v_inds": self.v_inds,
                    "ll": self.ll,
                    "v_star": self.v_star,
                    ...,
                },
                f,
            )

Just include the variables you end up using to build the pymc_model before running inference and place it somewhere in the repo you shared the files from? I just can't recreate parts of this from your SBML and data since it's lacking some information that I am familiar with to create certain variables for emll without making assumptions that may not be correct.

emll is unfortunately still pretty temperamental on how data is passed into the linlog model and subsequently how that linlog model is used. However, I think it could be a pymc problem, which is easier to address.

@janisshin
Copy link
Collaborator Author

Hi Andrew,
Here is the file you requested

@janisshin
Copy link
Collaborator Author

checking in again to see if you've got any updates or need more things from me.

@mcnaughtonadm
Copy link
Collaborator

I was able to run emll with your data. I wanted to test what running things using the normal hackett structure of establishing the pymc_model parameters found in the run_hackett_inference.py and got a slightly more "expected" ELBO plot:
janis

This was by using the fit() parameters you ran in your notebook.

I also tried running the pymc_model the way you constructed it in your notebook and was able to replicate what you were seeing:
janis2

Something I noticed that could be causing this weird behavior is your use of HalfNormal distributions as the sigma parameter for the Normal distributions. If we remove those we start to see the more expected looking shape (not perfect but doesn't exhibit this weird twisting)

janis3

We typically used fixed sigma but I don't have a good enough reason why you wouldn't be able to use a HalfNormal there. I guess my recommendation would be to make sure that the error is either defined correctly or required to be a distribution over a float value.

We'd need to do some more testing on our end to see if this type of HalfNormal error prior is useable, probably for the hackett example.

@janisshin
Copy link
Collaborator Author

@mcnaughtonadm,
I'm running another benchmarking test for BMCA using emll on a larger E. coli model and I'm finding that the vast majority of predicted elasticity values are not matching up with the ground truth elasticity values. Is this normal?

Here is the graph I'm getting
image
and here is the notebook that I used to make the graph.

I'm running the notebook again to check that the predicted flux values and predicted chi values match the groundtruth values. I hope to update the notebook with these graphs by tmrw morning.

@djinnome I just wanted to give you a heads up on this discrepancy before presenting it at our Wednesday meeting.

@djinnome
Copy link
Collaborator

djinnome commented Sep 23, 2024

How are you computing the predicted elasticity values? What is returned by Bayesian MCA is a posterior distribution over each elasticity.
One option is to compute a credible interval for each elasticity and see if the ground-truth elasticity falls within it,
Another option is to compute the mean average error between prior elasticities and ground truth elasticities and compare it with the mean average error between posterior and ground-truth elasticities to see if it learned anything from the data. Lastly, you should check to see if the ranking of the predicted FCC's is close the ranking of the ground-truth FCCs. If it is, then that indicates there could be parameter identifiability issues.

@janisshin
Copy link
Collaborator Author

How are you computing the predicted elasticity values? What is returned by Bayesian MCA is a posterior distribution over each elasticity.
One option is to compute a credible interval for each elasticity and see if the ground-truth elasticity falls within it,
Another option is to compute the mean average error between prior elasticities and ground truth elasticities and compare it with the mean average error between posterior and ground-truth elasticities to see if it learned anything from the data.

I'm using arviz.summary(trace)['mean']

Lastly, you should check to see if the ranking of the predicted FCC's is close the ranking of the ground-truth FCCs. If it is, then that indicates there could be parameter identifiability issues.

It's somewhat close. Out of the top10 ground truth FCC values, BMCA predicts about 6-7 correctly. Is this enough information or would it help if I also made a parity plot for the FCC rankings as well?

I'm not sure what parameter identifiability issues are, but happy to schedule a meeting to talk about it, if necessary.

@janisshin
Copy link
Collaborator Author

@djinnome ,

I've made the graphs you've requested for my benchmarking project, but I had some follow up questions which will be at the bottom.

Try 100%, 50%, and 10% metabolite coverage

100% internal metabolite data
50% internal metabolite data
10% internal metabolite data

Plot error of elasticities and FCCs of prior distribution vs ground truth

For elasticities, ctrl+F "Elasticities, comparison of prior and post" in the notebooks posted above to find the graphs

Compare FCC rankings of union of top 10 ground truth, 0.5x, 1.5x, and prior FCCs

ctrl+F "top 15 table" in the notebooks posted above to find the tables showing the top 15 ranked FCCs, ground truth and predicted.

Table columns

  • 0.5x-pr: 0.5x perturbation to all enzymes, prediction made on prior distribution
  • 1.5x-pr: 1.5x perturbation to all enzymes, prediction made on prior distribution
  • 0.5x-post: 0.5x perturbation to all enzymes, prediction made on posterior distribution
  • 1.5x-post: 1.5x perturbation to all enzymes, prediction made on posterior distribution

For the number of correctly predicted reactions in the top 10 ranked FCCs, ctrl+F "number of correct predictions of top 10 FCC values"

In the cell immediately after, you'll find the ground truth rankings of the other top 10 FCC predictions that are not in the ground truth top 10 FCCs. I know you asked for the union, but I think the set difference is more useful in seeing this data.

Try Eflux2 vs real flux data in terms of FCC rankings.

I will aim to have this done later today.

@janisshin
Copy link
Collaborator Author

@djinnome ,
I'd like to have more guidance on how to use Eflux2 on my model, given that I do not have any transcriptomics data.

Also, I have further questions about using emll on the putida model.
Earlier in the year, I was advised to use the emll.util.create_Ey_matrix(model) for building an Ey matrix.
I realized that this method relies on model.medium() to identify boundary indices.
Does this mean that the end product metabolite 4aca should be added to the model.medium? or is it considered to be an internal metabolite? or should it be left out of the model all-together?

Originally, I chose the last option, but then I did not know how to incorporate 4aca concentration data into emll.

Can we (or @mcnaughtonadm, whoever is more appropriate) meet to discuss this?

@djinnome
Copy link
Collaborator

djinnome commented Oct 1, 2024

Hey @janisshin do you have proteomics data? If so, that is actually a better use of Eflux2 than transcriptomics data, but the normalization might need to be a bit different to account for the fact that proteomics has different distributional qualities than transcriptomics.

@janisshin
Copy link
Collaborator Author

janisshin commented Oct 1, 2024

I do not have proteomics data either. Should I go find some on the internet?

I found this one, but I need to process the data it seems

@djinnome
Copy link
Collaborator

djinnome commented Oct 1, 2024

For the benchmark examples, I think you can just make up the proteomics data from the ground-truth model, yes?

With regard to real proteomics data, it would need to come from a strain that produces your end product (4ACA in this case).

You don't need to include 4ACA in the model.medium (which assumes inputs only) but your Ey matrix should contain external 4ACA. Also, when you run Eflux, you should definitely make sure that you are maximizing your end product with a minimal growth rate constraint. We can talk more about this tomorrow.

@janisshin
Copy link
Collaborator Author

We don't have a Biosystems meeting tmrw, but I'm happy to meet with you tmrw at a time that is most convenient for you. Please let me know when that would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants