Replies: 1 comment 1 reply
-
Very nice writeup! The idea to use diffusion to generated synthetic TFs rather than regulatory sequences based on diffusion is very intriguing. I think the hard part is how to train such model given that you will have limited data to couple the DNA sequence encoding a WT or mutated TF with its binding affinity toward a set of sequences. Also in practical terms people may argue that if you are able to deliver a new protein to cells (or integrate its encoding DNA sequence) in theory you can integrate a functional copy of the LDLR gene with the functional promoter. I don't want to sounds too negative but I want to provide some points to further think about. |
Beta Was this translation helpful? Give feedback.
-
Drug discovery background
Conventional drug discovery methods rely mostly on time and resource-intensive experimental procedures that often costs billions of dollars, take years to complete, and on top of that also suffer from extraordinarily high attrition rates of ~90%. To this end, computational methods are increasingly being integrated into the drug discovery pipeline with the hope to improve its speed and efficiency. Computational models can assist at various stages in the drug discovery pipeline, e.g. drug target identification, drug target validation, and toxicity prediction. Currently, a lot of research and development is happening on computational drug discovery methods so this particular area of research is highly relevant at the moment. OpenBioML also has a project that focuses on drug discovery, so this proposal potentially facilitates collaboration between these two projects.
Where does DNA diffusion fit in?
If successful, DNA diffusion models will provide SOTA means to generate novel DNA sequences. By leveraging guided diffusion, the generated sequences can be designed even for a particular, single target gene. Moreover, a diffusion model that takes in guiding information via a prompt can be used by researchers beyond just ML-focused computational biologists due to the ease of use of a natural language interface. Altogether, this means that the DNA diffusion architecture provides an alluring avenue for computational drug discovery.
How I am currently thinking we might use DNA diffusion in the context of drug discovery, is to prompt the model to design a sequence that can modulate gene expression, or restore gene expression modulation for mutated regulatory regions. This can be achieved by having the generated sequence, or some downstream product of that sequence such as a transcription factor, interact with the regulatory region of some gene that is associated with a diseased phenotype. To start off, I propose the diseased phenotype to be monogenic, as this would be the simplest proof-of-concept case.
Case-study: Familial hypercholesterolemia
Let me further clarify how I imagine this would ideally work with an example of a monogenic disease caused by an inherited defect in the promoter region of the LDLR gene. There will be quite a bit of biological terminology here, so if there is anything unclear, please don't hesitate to ask for clarification.
Familial hypercholesterolemia (FH) is an autosomal dominant disorder, which in humans occurs when one of the gametes involved in fertilization carries a mutated copy of the LDLR gene. People affected by FH are prone to develop atherosclerosis, i.e. a build-up of plaque on the inner lining of an artery. This, in turn, can result in heart attacks, strokes or aneurysms, which are often fatal. Atherosclerosis is the leading cause of cardiovascular disease, which is the leading cause of death worldwide with almost 18 million deaths in 2019 alone, representing 32% of the deaths for that year. Of those deaths around 200,000 are currently attributed to FH.
Before we can appreciate how DNA diffusion might assist in finding a drug for this disease, we must first consider the genetic basis of FH. LDLR is a gene located on chromosome 19 and under normal conditions gets expressed into the low density lipoprotein receptor (LDLR). The LDLR gene family consists of cell surface proteins involved in receptor-mediated uptake of specific molecules by the cell. In particular, "bad" cholesterol (LDL) is the predominant ligand interacting with LDLR. After uptake, the cholesterol can be used, stored or degraded. However, when LDLR does not function properly, the cholesterol can not enter the cell and remains in circulation. When the concentration of LDL in the blood gets to high, LDL increasingly coagulates at the inner artery lining leading to atherosclerosis. Impairment of LDLR function can happen in different ways. For example, a mutation in an exon of LDLR can cause its protein product to misfold and no longer function properly. While mutations along the entire LDLR gene can cause FH, mutations in the promoter region of LDLR are thought to cause more severe clinical phenotypes.
Mutations in the promoter region of LDLR lead to a significant reduction of the expression of LDLR. It was found that one transcription factor (TF) in particular seemed causative of FH: Sterol regulatory element-binding protein (SREBP). This TF binds to the LDLR promoter region in the absence of LDL in order to stimulate LDLR expression and hence uptake of extracellular LDL. As you might imagine, failure of SREBP to bind to the promoter reliably, results in a decreased ability of the cell to take in and process extracellular LDL leading to the diseased phenotype: atherosclerosis.
So how can we leverage DNA diffusion models to potentially recover LDLR function in patients whose SREBP binding region of the promoter is mutated? What we would need is a protein that has broadly the same structure as SREBP, with a slight adjustment in the promoter binding site so that the binding affinity between the generated SREBP (gen-SREBP) and the LDLR promoter region is restored. In this context, our pipeline ideally would look like this:
Input prompt into DNA diffusion model: "Generate a transcription factor that is structurally identical to SREBP, except for the promoter binding site, which should be adjusted to bind to {insert mutated promoter sequence}"
DNA diffusion generates our DNA sequence from noise
We transcribe and translate our sequence in silico
Fold the string of generated amino acids into its expected 3D conformation (using for example AlphaFold)
Assess binding affinity of gen-SREBP with the diseased promoter region (preferably in silico as well, although I am still searching for any existing tools that can quantify TF - binding site affinity computationally)
Rinse and repeat. This can of course be done in parallel and some prompt engineering might be needed to get the model to understand exactly what we want, but either way, this in silico pipeline would be a lot more tractable than trying to manually improve the SREBP protein and validating experimentally with ChIP-seq. In the end, we would of course still want to do this experiment, but by using the computational method outlined here, we can do this in an informed way rather than a naive trial-and-error process.
Concluding thoughts
This idea has been brewing in my head for a few weeks now. I found myself doubtful to proceed with actually sharing it because the outline here presents only a small fraction of all the challenges and nuances this idea faces. I was especially tentative of my example of FH, because if the proposed gen-SREBP would indeed function as hoped, then it would probably have to be administered in a similar fashion as current cardiovascular disease medication, so in that sense it is not much of an improvement. The benefit I would give this approach compared to current medicines to treat atherosclerosis, such as blood thinners or beta blockers, is that in this way we actually treat the cause of the athersclerosis rather than its symptoms. Whether this will also translate into clinical benefit remains to be seen. Regardless of my doubts, I decided to proceed with it. In part to assess the validity of this idea by some experts and also to hopefully inspire other people to think about downstream applications of DNA diffusion (or similar) models, not even necessarily related to drug discovery. Thank you for taking the time to read through this. If you have any questions, suggestions or remarks, feel free to drop a comment. I look forward to any discussion this post might provoke.
Beta Was this translation helpful? Give feedback.
All reactions