Replies: 5 comments 7 replies
-
Hi Tom! No, you've not missed something, and we should update the documentation to make this clear: all expectations-generated variables are currently independent of each other. As a result of you hitting this (common) roadbump, I've bumped the priority of addressing this problem in the framework; you may be interested to read this proposal for fixing it (with a lot of further, historic and fun detail in the linked ticket). In the mean time, there are a few workarounds; @wjchulme will add to this thread on that subject later. |
Beta Was this translation helpful? Give feedback.
-
I like to add that the principles and processes in DataSHIELD are complementary to our approach and we're keen to incorporate them (or something like them). We have a series of provably-safe implementations of various things like crosstabs in the pipeline right now. It would be great to pick your brains about DataSHIELD at some point. |
Beta Was this translation helpful? Give feedback.
-
Hi Tom The first point to make is that for the use-case you describe (although I appreciate it's a simplifcation) it may not be important that the dummy data doesn't properly respect the relationship you expect to see between T2D and BMI. If But yes, there are other occassions where the dummy data isn't enough for code to run successfully (because the models won't converge, or impossible combinations of values cause errors, or whatever), or there are structural relationships in the data that may be difficult to reflect (eg death dates necessarily occur after a diagnosis dates; no pregnant 5 year olds). Depending on exactly what you're trying to achieve, there are a few tricks you could apply to get a dataset looking more like you need within the existing constraints of the platform. One option is to generate multiple study definitions for different populations. In your case this would be a T2D population (using Another option is post-extraction sampling, where you re-sample with replacement from the dummy data and preferentially select those patients that fit your expectations better. This would (asympotically) acheive the required correlation structure. Though it's possibly more trouble than it's worth as you'd need to do the leg-work to define the appropriate sampling weights. You'd need to be sure to drop this step before running on the real data (see below for one quick and dirty option to make this easier) Another option is to generate the dummy data yourself. (See for example here https://nbviewer.jupyter.org/github/opensafely/mv-dummy-data/blob/main/prototype-report.html, with accompanying package here https://github.com/wjchulme/dd4d; or you could try eg simstudy). You'd need to make sure you use the correct dataset when running locally or on the server. You'd also need to manually check that the dummy variables are consistent with the variables implied by the study definition (we're considering ways to make such a check part of the platform). If your study pipeline needs to be different when you run on locally or on the server, you can create an environment variable on your machine, eg We appreciate that none of these approaches are ideal and we have lots of ideas about how to make things better. Keen to hear your thoughts if you have any! |
Beta Was this translation helpful? Give feedback.
-
Related to this I find myself wanting to set a random number generator seed such that the dummy data system generates exactly the same data on repeated runs of a study generation action, i.e. like in R setting I can see that a seed is set in the tests in import numpy as np
np.random.seed(1) at the top of a study_defintion file. My impression is that this basically works - but do correct me if I am wrong. Interestingly, the Ideally I'd like an extra key value pair, say study = StudyDefinition(
# Configure the expectations framework
default_expectations={
"date": {"earliest": "1970-01-01", "latest": latest_date},
"rate": "uniform",
"incidence": 0.8,
"seed": 123456789
},
...
) I'd say that ensuring exact reproducibility of the dummy data system is important, because a user can at least ensure the dummy data will run for a given seed. And users know they have a dummy dataset avoiding perfect separation and other issues that can cause model non-convergence and then errors in R for syntactically valid code. And it strikes me this is especially useful when several users are working on a repo together. |
Beta Was this translation helpful? Give feedback.
-
I am sure you have been making more progress on improving the way in which the dummy data is generated. I have taken inspiration from OpenSAFELY because I think it might help DataSHIELD users write their code if they have some dummy data. However, I decided to try and go down the direction suggested by @wjchulme above about using existing packages such as simstudy. This led me to look at synthpop which I have found gives good results and can be run without the user having to define the expectations themselves. I found it can also capture things in the raw data that you might not think to put in as expectations yourself. I don't know if synthpop will appeal, because its simplicity also means people don't have to think about whether the dummy data they receive makes sense. However, I thought I should mention it in the small chance that this isn't something you have thought about already in your very detailed discussions. Cheers Tom |
Beta Was this translation helpful? Give feedback.
-
I am a beginner user with opensafely so please bear with me.
I want to explore how opensafely might work with epidemiological studies, rather than health records. Currently we use DataSHIELD which works in a different way. Rather than having a manual check on the output of the analysis, it only provides analysis functions that are built to return non-disclosive results. It is always healthy to explore and learn from other solutions!
A toy problem that we might be interested in is exploring the relationship between BMI and incidence of type 2 diabetes (T2D). Therefore our analysis is a linear regression.
I am in the process of building my study definition to generate the dummy data. I can see how I can define things like expectations for the incidence of T2D and the distribution of BMI. However, I can't find an example of how I define my expectation of the relationship between these variables. This would then generate a dataset where I can run the regression and recreate that expectation.
Have I missed something, or is this not possible?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions