Proposal/Brainstorm: Using DNA diffusion models for drug discovery #11

aaronwtr · 2022-10-15T00:15:12Z

aaronwtr
Oct 15, 2022
Collaborator

Drug discovery background

Conventional drug discovery methods rely mostly on time and resource-intensive experimental procedures that often costs billions of dollars, take years to complete, and on top of that also suffer from extraordinarily high attrition rates of ~90%. To this end, computational methods are increasingly being integrated into the drug discovery pipeline with the hope to improve its speed and efficiency. Computational models can assist at various stages in the drug discovery pipeline, e.g. drug target identification, drug target validation, and toxicity prediction. Currently, a lot of research and development is happening on computational drug discovery methods so this particular area of research is highly relevant at the moment. OpenBioML also has a project that focuses on drug discovery, so this proposal potentially facilitates collaboration between these two projects.

Where does DNA diffusion fit in?

If successful, DNA diffusion models will provide SOTA means to generate novel DNA sequences. By leveraging guided diffusion, the generated sequences can be designed even for a particular, single target gene. Moreover, a diffusion model that takes in guiding information via a prompt can be used by researchers beyond just ML-focused computational biologists due to the ease of use of a natural language interface. Altogether, this means that the DNA diffusion architecture provides an alluring avenue for computational drug discovery.

How I am currently thinking we might use DNA diffusion in the context of drug discovery, is to prompt the model to design a sequence that can modulate gene expression, or restore gene expression modulation for mutated regulatory regions. This can be achieved by having the generated sequence, or some downstream product of that sequence such as a transcription factor, interact with the regulatory region of some gene that is associated with a diseased phenotype. To start off, I propose the diseased phenotype to be monogenic, as this would be the simplest proof-of-concept case.

Case-study: Familial hypercholesterolemia

Let me further clarify how I imagine this would ideally work with an example of a monogenic disease caused by an inherited defect in the promoter region of the LDLR gene. There will be quite a bit of biological terminology here, so if there is anything unclear, please don't hesitate to ask for clarification.

Familial hypercholesterolemia (FH) is an autosomal dominant disorder, which in humans occurs when one of the gametes involved in fertilization carries a mutated copy of the LDLR gene. People affected by FH are prone to develop atherosclerosis, i.e. a build-up of plaque on the inner lining of an artery. This, in turn, can result in heart attacks, strokes or aneurysms, which are often fatal. Atherosclerosis is the leading cause of cardiovascular disease, which is the leading cause of death worldwide with almost 18 million deaths in 2019 alone, representing 32% of the deaths for that year. Of those deaths around 200,000 are currently attributed to FH.

Before we can appreciate how DNA diffusion might assist in finding a drug for this disease, we must first consider the genetic basis of FH. LDLR is a gene located on chromosome 19 and under normal conditions gets expressed into the low density lipoprotein receptor (LDLR). The LDLR gene family consists of cell surface proteins involved in receptor-mediated uptake of specific molecules by the cell. In particular, "bad" cholesterol (LDL) is the predominant ligand interacting with LDLR. After uptake, the cholesterol can be used, stored or degraded. However, when LDLR does not function properly, the cholesterol can not enter the cell and remains in circulation. When the concentration of LDL in the blood gets to high, LDL increasingly coagulates at the inner artery lining leading to atherosclerosis. Impairment of LDLR function can happen in different ways. For example, a mutation in an exon of LDLR can cause its protein product to misfold and no longer function properly. While mutations along the entire LDLR gene can cause FH, mutations in the promoter region of LDLR are thought to cause more severe clinical phenotypes.

Mutations in the promoter region of LDLR lead to a significant reduction of the expression of LDLR. It was found that one transcription factor (TF) in particular seemed causative of FH: Sterol regulatory element-binding protein (SREBP). This TF binds to the LDLR promoter region in the absence of LDL in order to stimulate LDLR expression and hence uptake of extracellular LDL. As you might imagine, failure of SREBP to bind to the promoter reliably, results in a decreased ability of the cell to take in and process extracellular LDL leading to the diseased phenotype: atherosclerosis.

So how can we leverage DNA diffusion models to potentially recover LDLR function in patients whose SREBP binding region of the promoter is mutated? What we would need is a protein that has broadly the same structure as SREBP, with a slight adjustment in the promoter binding site so that the binding affinity between the generated SREBP (gen-SREBP) and the LDLR promoter region is restored. In this context, our pipeline ideally would look like this:

Input prompt into DNA diffusion model: "Generate a transcription factor that is structurally identical to SREBP, except for the promoter binding site, which should be adjusted to bind to {insert mutated promoter sequence}"
DNA diffusion generates our DNA sequence from noise
We transcribe and translate our sequence in silico
Fold the string of generated amino acids into its expected 3D conformation (using for example AlphaFold)
Assess binding affinity of gen-SREBP with the diseased promoter region (preferably in silico as well, although I am still searching for any existing tools that can quantify TF - binding site affinity computationally)

Rinse and repeat. This can of course be done in parallel and some prompt engineering might be needed to get the model to understand exactly what we want, but either way, this in silico pipeline would be a lot more tractable than trying to manually improve the SREBP protein and validating experimentally with ChIP-seq. In the end, we would of course still want to do this experiment, but by using the computational method outlined here, we can do this in an informed way rather than a naive trial-and-error process.

Concluding thoughts

This idea has been brewing in my head for a few weeks now. I found myself doubtful to proceed with actually sharing it because the outline here presents only a small fraction of all the challenges and nuances this idea faces. I was especially tentative of my example of FH, because if the proposed gen-SREBP would indeed function as hoped, then it would probably have to be administered in a similar fashion as current cardiovascular disease medication, so in that sense it is not much of an improvement. The benefit I would give this approach compared to current medicines to treat atherosclerosis, such as blood thinners or beta blockers, is that in this way we actually treat the cause of the athersclerosis rather than its symptoms. Whether this will also translate into clinical benefit remains to be seen. Regardless of my doubts, I decided to proceed with it. In part to assess the validity of this idea by some experts and also to hopefully inspire other people to think about downstream applications of DNA diffusion (or similar) models, not even necessarily related to drug discovery. Thank you for taking the time to read through this. If you have any questions, suggestions or remarks, feel free to drop a comment. I look forward to any discussion this post might provoke.

lucapinello · 2022-10-15T01:05:22Z

lucapinello
Oct 15, 2022
Maintainer

Very nice writeup! The idea to use diffusion to generated synthetic TFs rather than regulatory sequences based on diffusion is very intriguing. I think the hard part is how to train such model given that you will have limited data to couple the DNA sequence encoding a WT or mutated TF with its binding affinity toward a set of sequences. Also in practical terms people may argue that if you are able to deliver a new protein to cells (or integrate its encoding DNA sequence) in theory you can integrate a functional copy of the LDLR gene with the functional promoter. I don't want to sounds too negative but I want to provide some points to further think about.

1 reply

aaronwtr Oct 15, 2022
Collaborator Author

Thanks for the feedback! Those are all fair points and food for thought. And let me also just say that I don't take constructive criticism to be negative, so don't worry about that. I think ideas eventually become successful if they stand the test of all scrutiny. So the more scrutiny the better ;)

I agree with the difficulty in training a variant of the DNA diffusion model that can learn TF sequence - regulatory sequence associations. While I'm not thoroughly familiar with the dataset we are currently using, I don't think that one in isolation would be enough. Maybe something like ChEA can be used to curate a dataset that would suit this purpose better?

For your practical remark, that is something I was toying with as well. What is the actual benefit of doing it this way compared to integrating a functional copy of LDLR into the genome? I think that there are a lot more points of failure associated with trying to introduce a functional copy of the LDLR gene in the liver. For starters, the innate immune system has quite a few defence mechanisms against endogenously introduced DNA that might interfere with introducing the DNA in this way. And even if one can successfully present the gene intranuclearly, having just the gene present in the nucleus is not a guarantee it will produce sufficient LDLR to restore its function. Alternatively, some sort of iPSC procedure where one would integrate the functional gene ex vivo and then transplant the tissue back seems quite cumbersome since the target organ, in this case, is the liver. I know this type of treatment has worked previously to introduce functional gene copies in epidermal and retinal tissues, but I'm not sure this will also easily translate to liver tissue. TF-based therapeutics, on the other hand, might provide a more direct means of achieving the desired result. Currently proposed methods of delivering the TF would be a DART construct or via tethering to protein transduction domains (PTDs). A more recent review of the different currently proposed ways to deliver TFs can be found here.

I'll definitely continue to think about this idea while we are pushing the project forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal/Brainstorm: Using DNA diffusion models for drug discovery #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Proposal/Brainstorm: Using DNA diffusion models for drug discovery #11

aaronwtr Oct 15, 2022 Collaborator

Drug discovery background

Where does DNA diffusion fit in?

Case-study: Familial hypercholesterolemia

Concluding thoughts

Replies: 1 comment · 1 reply

lucapinello Oct 15, 2022 Maintainer

aaronwtr Oct 15, 2022 Collaborator Author

aaronwtr
Oct 15, 2022
Collaborator

Replies: 1 comment 1 reply

lucapinello
Oct 15, 2022
Maintainer

aaronwtr Oct 15, 2022
Collaborator Author