How do I define an expectation for a relationship between 2 variables in my study definition? #189

tombisho · 2021-03-18T09:59:38Z

tombisho
Mar 18, 2021

I am a beginner user with opensafely so please bear with me.

I want to explore how opensafely might work with epidemiological studies, rather than health records. Currently we use DataSHIELD which works in a different way. Rather than having a manual check on the output of the analysis, it only provides analysis functions that are built to return non-disclosive results. It is always healthy to explore and learn from other solutions!

A toy problem that we might be interested in is exploring the relationship between BMI and incidence of type 2 diabetes (T2D). Therefore our analysis is a linear regression.

I am in the process of building my study definition to generate the dummy data. I can see how I can define things like expectations for the incidence of T2D and the distribution of BMI. However, I can't find an example of how I define my expectation of the relationship between these variables. This would then generate a dataset where I can run the regression and recreate that expectation.

Have I missed something, or is this not possible?

Thanks

sebbacon · 2021-03-18T13:20:12Z

sebbacon
Mar 18, 2021
Maintainer

Hi Tom! No, you've not missed something, and we should update the documentation to make this clear: all expectations-generated variables are currently independent of each other.

As a result of you hitting this (common) roadbump, I've bumped the priority of addressing this problem in the framework; you may be interested to read this proposal for fixing it (with a lot of further, historic and fun detail in the linked ticket).

In the mean time, there are a few workarounds; @wjchulme will add to this thread on that subject later.

1 reply

tombisho Mar 18, 2021
Author

That's great, I'll continue tinkering and look forward to hearing more. Thanks for the quick response!

sebbacon · 2021-03-18T13:23:32Z

sebbacon
Mar 18, 2021
Maintainer

I like to add that the principles and processes in DataSHIELD are complementary to our approach and we're keen to incorporate them (or something like them). We have a series of provably-safe implementations of various things like crosstabs in the pipeline right now. It would be great to pick your brains about DataSHIELD at some point.

1 reply

tombisho Mar 18, 2021
Author

Yes I agree that we should explore this further. If it's OK I will be in touch via email when I have been able to digest everything further - say in a few weeks or so?

wjchulme · 2021-03-18T22:25:49Z

wjchulme
Mar 18, 2021
Maintainer

Hi Tom

The first point to make is that for the use-case you describe (although I appreciate it's a simplifcation) it may not be important that the dummy data doesn't properly respect the relationship you expect to see between T2D and BMI. If 10% of patients have T2D, and BMI is distributed Normal(25, 5), then you'll still be able to run a regression model on those data. It won't give you the "right" answer, but it is enough to develop your code, prepare tables and figures, etc, and check that it will run successfully on the server. Often that's enough, and we'd emphasise simplicity where possible rather than trying to match the real data exactly.

But yes, there are other occassions where the dummy data isn't enough for code to run successfully (because the models won't converge, or impossible combinations of values cause errors, or whatever), or there are structural relationships in the data that may be difficult to reflect (eg death dates necessarily occur after a diagnosis dates; no pregnant 5 year olds). Depending on exactly what you're trying to achieve, there are a few tricks you could apply to get a dataset looking more like you need within the existing constraints of the platform.

One option is to generate multiple study definitions for different populations. In your case this would be a T2D population (using incidence: 1 for the T2D variable) and a non-T2D population (incidence: 0), and use different BMI distributions for each population. Then combine the data post-extraction. This obviously becomes unwieldy very quickly if you need to define multiple correlated variables (although using common variables can help).

Another option is post-extraction sampling, where you re-sample with replacement from the dummy data and preferentially select those patients that fit your expectations better. This would (asympotically) acheive the required correlation structure. Though it's possibly more trouble than it's worth as you'd need to do the leg-work to define the appropriate sampling weights. You'd need to be sure to drop this step before running on the real data (see below for one quick and dirty option to make this easier)

Another option is to generate the dummy data yourself. (See for example here https://nbviewer.jupyter.org/github/opensafely/mv-dummy-data/blob/main/prototype-report.html, with accompanying package here https://github.com/wjchulme/dd4d; or you could try eg simstudy). You'd need to make sure you use the correct dataset when running locally or on the server. You'd also need to manually check that the dummy variables are consistent with the variables implied by the study definition (we're considering ways to make such a check part of the platform).

If your study pipeline needs to be different when you run on locally or on the server, you can create an environment variable on your machine, eg OS_RUNNING_LOCALLY = "1", and use the existence of that variable (eg with Sys.getenv(x="OS_RUNNING_LOCALLY")=="" in R) to direct your script accordingly, since it will exist locally but it won't exist on the server.

We appreciate that none of these approaches are ideal and we have lots of ideas about how to make things better. Keen to hear your thoughts if you have any!

2 replies

tombisho Mar 19, 2021
Author

Hi Will,

Thank you for this very well-considered answer. I will try to add to it as best I can, but from what I can see you have already given this a lot of thought. I have indeed progressed in the way you suggest for this simple use-case - just accept that the relationship in the dummy data is not correct, but it was enough to develop my code and get it running on the server.

I think the 3 options you describe are good approaches, but as a beginner in OpenSafely it is another aspect to get right in the pipeline (it was tricky for me getting it working as it was!). I particularly like the idea of using simstudy as I have had success with that in the past. I guess my thought is that these approaches (and indeed the current uni-variate approach) require the analyst to have some hypothesis about the relationships between the variables. I have a very high level understanding of this technique, which seems to me to allow simulated data to be built based on taking from the real data correlations between the variables in question, and the mean/sd of those variables. Thus you don't need to specify all that yourself. I could see this working here in the example I give and with more variables. It wouldn't seem to solve the structural challenges you describe though as these seem quite varied in the way they need to be specified.

This leads me round a complete circle to then think about the approach we have in DataSHIELD. In my use case, there is a DataSHIELD function that allows a linear model to be fitted, and the results that get returned are subject to various constraints to reduce the possibility of identifiable data being returned. For example, the residuals don't get returned as this might allow reconstruction of original data. No results are returned if any categorical variable has less than 5 individuals in a group to avoid the coefficient of that variable revealing individual data. And so on. Because this result should not have identifiable data in, when run on the real data we don't have the manual review step and the results are just returned to the user. Thus the result can be obtained very quickly and if there is an error in the code it can be fixed quickly and rerun. In this scenario we don't need the dummy data because we don't need to be very certain that our code will run in the live environment, as we don't need to wait for a manual approval.

The flip side of the DataSHIELD approach is that you then work blind on the data; you are constrained to functions and the R-based syntax that are built into the DataSHIELD framework; new functions have to be built into DataSHIELD if they are not there; and it is hard to guarantee that no identifiable data can be released - just minimise the risk.

I guess this could form the basis of a discussion about how the approaches complement each other!

sebbacon Mar 19, 2021
Maintainer

This is really useful feedback, thanks. Loads of food for thought.

In general it is absolutely in the roadmap to provide a set of known-safe functions that could be safely returned to the user without intervention.

Some quick further thoughts:

In this scenario we don't need the dummy data because we don't need to be very certain that our code will run in the live environment, as we don't need to wait for a manual approval

The dummy data is also useful for reproducibility as it guarantees that the pipe is, at least, runnable, through continuous integration in Github.

Also, even if all real outputs were provably non-disclosive and returned without intervention, more generally-suitable dummy data would still be useful because round-tripping to the live environment will always be a matter of minutes, sometimes much more (e.g. cox regressions) - if we can get it right, development should be much faster if it can mostly be done locally (it will always be the case some debugging will happen on real data, I suspect).

my thought is that these approaches (and indeed the current uni-variate approach) require the analyst to have some hypothesis about the relationships between the variables

This is by design: a more user-friendly and powerful future version might let you say what you expect about outlier values, nulls, etc -- it's very common for these assumptions never to be verified and only to be noticed in exceptional cases, our hypothesis is that forcing someone to be systematic about documenting these would improve quality. I think we've not yet proved that this is a useful approach (partly because the framework isn't complete) so I'm in favour of us making it optional until we've learned more about the problems.

remlapmot · 2021-03-24T12:02:52Z

remlapmot
Mar 24, 2021
Collaborator

Related to this I find myself wanting to set a random number generator seed such that the dummy data system generates exactly the same data on repeated runs of a study generation action, i.e. like in R setting set.seed(####) at the top of a simulation study to make that totally reproducible.

I can see that a seed is set in the tests in test_expectation_generator.py. Copying that approach I tried simply including

import numpy as np
np.random.seed(1)

at the top of a study_defintion file. My impression is that this basically works - but do correct me if I am wrong. Interestingly, the patient_id variable does not seem to respect the seed, but so far the other variables I tried do.

Ideally I'd like an extra key value pair, say "seed":######, that users could set in the StudyDefinition() default_expectations , e.g.

study = StudyDefinition(
    # Configure the expectations framework
    default_expectations={
        "date": {"earliest": "1970-01-01", "latest": latest_date},
        "rate": "uniform",
        "incidence": 0.8,
        "seed": 123456789
    },
    ...
)

I'd say that ensuring exact reproducibility of the dummy data system is important, because a user can at least ensure the dummy data will run for a given seed. And users know they have a dummy dataset avoiding perfect separation and other issues that can cause model non-convergence and then errors in R for syntactically valid code. And it strikes me this is especially useful when several users are working on a repo together.

1 reply

wjchulme Mar 25, 2021
Maintainer

Thanks for the suggestion, I've created a new issue in the cohortextractor repo for this:

opensafely-core/cohort-extractor#507

Ultimately this is a fairly straight forward but low priority feature. If you have a convincing use-case we could think about how to add it in in a way that doesn't add any overhead for new users.

tombisho · 2021-07-13T19:37:39Z

tombisho
Jul 13, 2021
Author

I am sure you have been making more progress on improving the way in which the dummy data is generated. I have taken inspiration from OpenSAFELY because I think it might help DataSHIELD users write their code if they have some dummy data. However, I decided to try and go down the direction suggested by @wjchulme above about using existing packages such as simstudy. This led me to look at synthpop which I have found gives good results and can be run without the user having to define the expectations themselves. I found it can also capture things in the raw data that you might not think to put in as expectations yourself.

I don't know if synthpop will appeal, because its simplicity also means people don't have to think about whether the dummy data they receive makes sense. However, I thought I should mention it in the small chance that this isn't something you have thought about already in your very detailed discussions.

Cheers

Tom

2 replies

wjchulme Jul 14, 2021
Maintainer

Thanks, @tombisho. I agree using synthetic data would take away a lot of the up-front cost in having to define relationships between variables explicitly, as is required for simulated data. However, we've discussed this before and neither of the two approaches for using synthetic data in opensafely are particularly feasible:

The first approach is to synthesise the raw underlying database once (or infrequently), and create dummy datasets from that using a study definition in the usual way, but without the need for expectations. This is fantastically difficult because of the relational complexity of the database, time-dependencies in event-level data, structural contraints that may need to be hard-coded, sparse and highly-correlated clinical codes etc etc. Even if we somehow faithfully replicate the entire database non-disclosively, it will quickly become out-dated as tables / variables / codes are added to the database over time. We would also need to provide online access to this dummy database, as it's far too big to be shipped with the existing opensafely tooling.

The second approach is to synthesise at the study definition level. That is, create a real dataset using a study definition, then create a synthetic dataset from that. This would require real-time access to the database each time a new synthetic dataset is needed. This is problematic: it would take a lot longer to create the dataset than the current simulated data approach; it would increase the burden on the server; the synthetic data would need to be very carefully curated to guarantee that no vestigial disclosive information is leaked without requiring manual checks, and even if this were possible, it may still be vulnerable to differencing attacks with multiple synthetic data extracts.

The second approach is not a complete write-off, but it needs a lot more thought. If you have any ideas I'd be keen to hear about them!

tombisho Jul 14, 2021
Author

I agree that first option is not practical.

I also agree that the second approach is not a complete write-off:

I think there would be a balance in the time taken - it would take a lot less time to define the synthetic dataset, and the request could be submitted as a job to return the data when it is ready. So overall it could be quicker, especially for beginners.
I guess with careful job management the burden on the server could be kept at an appropriate level. At least there is a crude solution to server burden which is to mange requests and/or beef up the server....
Although I can't pretend I have read it, I think there is quite a lot of literature around that makes the case that the chance of leakage is very low with the synthesis techniques. But I take your point that the safest option is to have no possibility of that by not doing it...
I guess it still does require some understanding on the users' part so that they don't ask for huge synthetic datasets. And I have only looked at simple examples, but I imagine reality would be very complex with things like event data.

The other option I did look at was a method that generates the synthetic data on the client side by extracting the actual correlations between all the variables you want to simulate, along with the mean, sd etc of all the variables. That could be something that is automated if you move towards allowing certain aggregated summaries to come out of the data without manual checks (e.g. mean of a variable).

In summary, I can see that this is not a clear winner given the amount it would take to implement, but thought it was worth asking about!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I define an expectation for a relationship between 2 variables in my study definition? #189

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How do I define an expectation for a relationship between 2 variables in my study definition? #189

tombisho Mar 18, 2021

Replies: 5 comments · 7 replies

sebbacon Mar 18, 2021 Maintainer

tombisho Mar 18, 2021 Author

sebbacon Mar 18, 2021 Maintainer

tombisho Mar 18, 2021 Author

wjchulme Mar 18, 2021 Maintainer

tombisho Mar 19, 2021 Author

sebbacon Mar 19, 2021 Maintainer

remlapmot Mar 24, 2021 Collaborator

wjchulme Mar 25, 2021 Maintainer

tombisho Jul 13, 2021 Author

wjchulme Jul 14, 2021 Maintainer

tombisho Jul 14, 2021 Author

tombisho
Mar 18, 2021

Replies: 5 comments 7 replies

sebbacon
Mar 18, 2021
Maintainer

tombisho Mar 18, 2021
Author

sebbacon
Mar 18, 2021
Maintainer

tombisho Mar 18, 2021
Author

wjchulme
Mar 18, 2021
Maintainer

tombisho Mar 19, 2021
Author

sebbacon Mar 19, 2021
Maintainer

remlapmot
Mar 24, 2021
Collaborator

wjchulme Mar 25, 2021
Maintainer

tombisho
Jul 13, 2021
Author

wjchulme Jul 14, 2021
Maintainer

tombisho Jul 14, 2021
Author