This code base generates demographically matched subsets. It was specifically designed for ABCD, but can be used for other data repositories as well. It randomly selects subsets of each ABCD ARM and checks whether they have a statistically significant difference on any of the given demographic variables. Once it finds a subset of each ARM where each does not statistically differ on those variables from the other ARM, it saves each subset pair to a .csv file. It also outputs correlation values between the subsets and the groups, and graph visualizations of those correlations.
For ABCC, see the full see the names and descriptions of the main demographic variables here
automated_subset_analysis.py
accepts demographic data for 2 datasets. For each dataset, it randomly selects many subsets which are not significantly different from the other dataset's total demographics.- Once this script has made subsets of both groups, it finds correlations between the average of each subset and the other group. It also finds the correlation between the averages of both subsets. This process repeats a specified number of times for subsets which include specified numbers of subjects. After finding the correlations, the script saves them to
.csv
files. - The correlation values are then plotted as data points on graph visualizations. Those graphs are saved when
automated_subset_analysis
finishes executing.
For a more detailed explanation, see this document's Explanation of Process
section.
Installation should be simple:
- Clone this repository to a location on your local filesystem.
- Run
pip install -r requirements.txt
from within the newautomated_subset_analysis
directory. - To verify that the code is set up, run
python3 automated_subset_analysis.py --help
within the same directory.
- The only dependency is Python 3.6.8 or greater.
- All of the Python packages required by this script which are not default can be found in this directory's
requirements.txt
file.
You must provide 2 csv files (group_1_demo_file
and group_2_demo_file
), each containing demographic data for all subjects in group 1 and group 2 respectively. The first line of each demographics file should list all of the column names.
- column 1 - should be blank because it is an index/enumerated column
- column 2 - Subject ID
- column 3 - group/ARM
- intermediate columns - demographic variables. By default, the script will assume the csv files contain columns of numerical data under each of these names: demo_comb_income_v2b, demo_ed_v2, demo_prnt_ed_v2b, demo_sex_v2b, ehi_y_ss_scoreb interview_age, medhx_9a, race_ethnicity, rel_relationship, site_id_l
- Final column: list of paths (1 per line) to the
.nii
files of all subjects in group
Example of a basic call to this script:
demo1=/home/user/conan/data/group1_pconn.csv
demo2=/home/user/conan/data/group2_pconn.csv
python3 automated_subset_analysis.py ${demo1} ${demo2}
You will need either
- two averaged matrix
.pconn.nii
files, one for ARM-1 and another for ARM-2; or - two
.conc
files listing.pconn.nii
file paths. Each.conc
file must list the path to every each file matrix file for every individual subject in each ARM.
-
--group-1-avg-file
takes one valid path to a readable.nii
file containing the average matrix for the entire group 1. By default, this path will be to thegroup1_10min_mean.pconn.nii
file in this script's parent folder or in the--output
folder. -
--group-2-avg-file
also takes one valid path to a readable.nii
file, just like--group-1-avg-file
. By default, it will point to thegroup2_10min_mean.pconn.nii
file in one of the same places. -
--group-1-var-file
also takes a valid.nii
file path pointing to group 1's total variance matrix. By default, it will point to thegroup1_variance_matrix.*.nii
file in one of the same places. -
--group-2-var-file
also takes a valid.nii
file path pointing to group 2's total variance matrix. By default, it will point to thegroup2_variance_matrix.*.nii
file in one of the same places. -
--matrices-conc-1
takes one path to a readable.conc
file containing only a list of valid paths to group 1 matrix files. This flag is only needed if your group 1 demographics.csv
file either does not have a column labeled'pconn10min'
with paths to matrix files, or if it does include that column but you want to use different paths. -
--matrices-conc-2
also takes one path to a readable.conc
file. It is just like--matrices-conc-1
, but for group 2. -
--output
takes one file path to a directory where all files produced by this script will be saved. If the directory already exists, then this script will add files to it and only overwrite files with conflicting filenames. If not, then this script will create a directory at the--output
path. If this flag is excluded, then the script will save files to a new subdirectory of the present working directory,./data/
.
-
--n-analyses
takes one positive integer, the number of times to generate a pair of subsets and analyze them. For every integer given in the--subset-size
list, this script will randomly generate--n-analyses
subsets, creating--subset-size * --n-analyses
total.csv
files. -
--nan-threshold
takes one floating-point number between 0 and 1. If the percentage of rows with Not-a-Number (NaN) values in the data for either group's demographic file is greater than the--nan-threshold
, then the script will raise an error. Otherwise, the script will drop every row containing a NaN. The default NaN threshold is0.1
, meaning that if over 10% of rows have a NaN value, the script will crash. -
--subset-size
takes one or more positive integers, the number of subjects to include in subsets. Include a list of whole numbers to generate subsets pairs of different sizes. By default, the subset sizes will be[50, 100, 200, 300]
. An example of entering a different list of sizes is--subset-size 100 300 500 1000
.
-
--only-make-graphs
takes one or more paths to readable.csv
files as a parameter. Include this flag to import average correlations data from.csv
files instead of making any new ones. Given this flag,automated_subset_analysis.py
only makes graph visualizations of already-existing data.-
If this flag is included, it must include paths to readable
.csv
files with 2 columns:Subjects
(the number of subjects in each subset) andCorrelation
(the correlation between each randomly generated subset in that pair). -
Giving this flag multiple
.csv
files will put all of their correlations onto one visualization.
-
-
--skip-subset-generation
takes either no parameters or one path to a readable directory as a parameter. Include this flag to calculate correlations and create the visualization using existing subsets instead of randomly generating new ones. By default, the subsets to use for calculating the correlations between average matrices and producing a visualization will be assumed to exist in the--output
folder. To load subsets from a different folder, add the path to this flag as a parameter.
- If
--skip-subset-generation
or--only-make-graphs
is included, then--subset-size
and--n-analyses
will do nothing. - If
--only-make-graphs
is included, then--skip-subset-generation
will do nothing. - Unless the
--only-make-graphs
flag is used, the.csv
file(s) with subsets' average correlations will/must be calledcorrelations_sub1_sub2.csv
,correlations_sub1_all2.csv
, andcorrelations_sub2_all1.csv
.
-
--fill
takes one parameter, a string that is eitherall
orconfidence-interval
. Include this flag to choose which data to shade in the visualization. Chooseall
to shade in the area within the minimum and maximum correlations in the dataset. Chooseconfidence-interval
to only shade in the 95% confidence interval of the data. By default, neither will be shaded. This argument cannot be used if--only-make-graphs
has multiple parameters. -
--hide-legend
takes no parameters. Unless this flag is included, the output visualization(s) will display a legend in the top- or bottom-right corner showing the name of each thing plotted on the graph: data points, average trendline, confidence interval, and/or entire data range. -
--plot
takes one or more strings:scatter
and/orstdev
. By default, a visualization will be made with only the average value for each subset size. Include this flag with the parameterscatter
to also plot all data points as a scatter plot, and/or with the parameterstdev
to also plot standard deviation bars for each subset size. -
--rounded-scatter
takes no parameters. Include this flag to reduce the total number of data points plotted on any scatter-plot visualization by only including points at rounded intervals. This flag does nothing unless--plot
includesscatter
.
-
--axis-font-size
takes one positive integer, the font size of the text on both axes of the visualizations that this script will create. If this argument is excluded, then by default, the font size will be30
. -
--graph-title
takes one string, the title at the top of all output visualizations and the name of the output.html
visualization files. To break the title into two lines, include<br>
in the--graph-title
string. Unless this flag is included, each visualization will have one of these default titles:- "Correlations Between Average Subsets"
- "Group 1 Subset to Group 2 Correlation"
- "Group 1 to Group 2 Subset Correlation"
- "Correlation Between Unknown Groups"
-
--marker-size
takes one positive integer to determine the size (in pixels) of each data point in the output visualization. The default size is 5. -
--place-legend
takes one number between 0 and 1, the location of the legend on the y-axis in the output visualization. 0 is the very bottom of the visualization and 1 is the very top. By default, this value will be 0.05. -
--title-font-size
takes one positive integer. It is just like--axis-font-size
, except for the title text in the visualizations. This flag determines the size of the title text above the graph as well as both axis labels. If this argument is excluded, then by default, the font size will be40
. -
--trace-titles
takes one or more strings. Each will label one dataset in the output visualization. Each should be the title of one of the.csv
files given to--only-make-graphs
. Include exactly as many titles as there are--only-make-graphs
parameters, in exactly the same order as those parameters, to match titles to datasets correctly. This argument only does anything when running the script in--only-make-graphs
mode. -
--y-range
takes two floating-point numbers, the minimum and maximum values to be displayed on the y-axis of the graph visualizations that this script will create. By default, this script will automatically set the y-axis boundaries to show all of the correlation values and nothing else.
The following arguments only apply when making a visualization using the compiled MATLAB code instead of the Python Plotly code. So, they do nothing unless the --plot-with-matlab
argument is included.
-
--plot-with-matlab
takes one string, a valid path to an existing directory for the MATLAB Runtime Environment v9.4. Include this flag to create the output visualization using compiled MATLAB "MultiShadedBars" code (seesrc
). Otherwise, none of thematlab
flags will do anything and the subset analysis code will produce an output visualization usingplotly
. -
--matlab-lower-bound
takes one decimal number between 0 and 1, the lower bound of data to display on the MATLAB output visualization. -
--matlab-no-edge
takes no parameters. By default, the output visualization will display an edge. Include this flag to hide that edge. -
--matlab-show
takes no parameters. Include this flag to display the threshold as a line on the output visualization. Otherwise, the line will not be shown. -
--matlab-upper-bound
takes one decimal number between 0 and 1, the upper bound of data to display on the MATLAB output visualization. -
--matlab-rgba
takes 3 to 5 3 to 5 numbers between 0 and 1, the RGBA values and line threshold for producing the visualization. Respectively those numbers are the red value, green value, blue value, (optional) alpha opacity value, and (optional) threshold to include a line at on the visualization.
-
--columns
takes one or more strings. Each should be the name of a column in the demographics.csv
which contains numerical data to include in the subset correlations analysis. By default, the script will assume that both input demographics.csv
files have columns of numerical data with these names:demo_comb_income_v2b, demo_ed_v2, demo_prnt_ed_v2b, demo_sex_v2b, ehi_y_ss_scoreb interview_age, medhx_9a, race_ethnicity, rel_relationship, site_id_l
-
--calculate
takes one string to define the output metric. With its default value ofmean
, the subset analysis will calculate correlations between subsets'/groups' average values. Use--calculate variance
to correlate the subsets' variances instead, or--calculate effect-size
to measure the effect size of the difference between each subset and the total group. -
--inverse-fisher-z
takes no parameters. Include this flag to do an inverse Fisher-Z transformation on the matrices imported from the.pconn
files of the data before getting correlations. -
--no-matching
takes no parameters. Include this flag to match subsets on every demographic variable except family relationships. Otherwise, subsets will be matched on all demographic variables.-
By default,
automated_subset_analysis.py
checks that every subset of one group has the same proportion of twins, triplets, or other siblings as the other group. It also checks that no one in the subset has family members outside the subset.--no-matching
will skip both checks. -
Use this flag if
--subset-size
includes any number under 25, because family relationship matching takes a very long time for small subsets.
-
-
--parallel
takes one valid path, the directory containingautomated_subset_analysis.py
. It will automatically be included byasa_submitter.py
to simultaneously run multiple different instances ofautomated_subset_analysis.py
as a batch command. Otherwise, this flag is not needed. Do not use this flag, because it will be included automatically if needed.
For more information, including the shorthand flags for each option, run this script with the --help
command: python3 automated_subset_analysis.py --help
Generate 50 subsets and save a .csv
file of each, including 10 subsets each of sizes 50, 100, 300, 500, and 1000:
python3 automated_subset_analysis.py \
${demo1} ${demo2} \
--subset-size 50 100 300 500 1000 \
--n-analyses 10
Calculate the correlations between average matrices of already-generated subsets in the ./subsets/
folder, then save the correlations and a visualization of them to the ./correls/
folder:
python3 automated_subset_analysis.py \
${demo1} ${demo2} \
--skip-subset-generation ./subsets/ \
--output ./correls/
The script will save the demographically-matched subset pair into a text file in the --output
directory. Each one will be named subset_{x}_with_{y}_subjects.csv
, where x
ranges from 1 to the --n-analyses
value and y
is every value in the --subset-size
list.
Two .csv
files, each with demographic data about subjects from a group, are given by the user. One subset is randomly selected from each group repeatedly. The amount of subjects in each subset depends on --subset-size
, and the number of times that amount is selected depends on --n-analyses
.
Once a pair has been selected, the script calculates the Euclidean distance between the average demographics of each subset and the average demographics of the whole other group. If the Euclidean distance between the subset and the total is higher than the estimated maximum value for significance (given in the equation calculated by ./src/euclidean_threshold_estimator.py
1 ), then another subset is randomly generated and tested for significance. Otherwise, the subset pair is deemed valid and saved to a .csv
file. The .csv
has one subset per column and one subject ID per row, excluding the header row which only contains the group number of each subset.
After finding a valid pair of subsets, the script calculates the correlation between the subset of group 1 and the subset of group 2. This correlation value is stored with the number of subjects in both subsets described by the correlation. Once the correlation values are all calculated, each correlation value is saved out to a .csv
file with a name starting with correlations
. That .csv
has two columns. It has one row per subset pair, excluding the header row which contains only the names of the columns: Subjects
and Correlation
.
Once the correlation .csv
files are made, the script will make a graph visualization for each one. That graph will plot how the number of subjects in a subset pair (x-axis) relates to the correlation between the subsets in that pair (y-axis). 2
1 The equation currently used in automated_subset_analysis.py
to predict significant Euclidean distance threshold using subset size was found using this Bash code:
python3 ./src/euclidean_threshold_estimator.py \
./raw/ABCD_2.0_group1_data_10minpconns.csv \
./raw/ABCD_2.0_group2_data_10minpconns.csv \
-con-vars ./automated_subset_analysis_files/continuous_variables.csv \
--subset-size 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 90 80 70 60 50 \
--n-analyses 10
The data used to calculate that equation can be found in ./src/euclidean_threshold_estimate_data/est-eu-thresh-2019-12-12
.
2 If --plot-with-matlab
is not used, the output visualization will include:
- One trendline using the average correlation values of each subset size (or more if
--only-make-graphs
includes multiple parameters), - A shaded region showing a data range, either the confidence interval or all data (if
--fill
is used), - Each correlation value as 1 data point (if
--plot
includesscatter
; if--rounded-scatter
is used, only correlation values at specific intervals will be plotted), - Standard deviation bars above and below each data point (if
--plot
includesstdev
), and - A legend to identify all of these parts (unless
--hide-legend
is used).
Full error:
Not enough subjects in population to randomly select a sample with {X} subjects, because {Y} subjects cannot be randomly swapped out from a pool of {Z} subjects
or
ValueError: Cannot take a larger sample than population when 'replace=False'
Problem: At least one of the --subset-size
values is too high.
Solutions:
- Reduce the largest
--subset-size
value to, at most, about 45% of the smallest ARM's size. - Include the
--no-matching
flag to skip family matching.
Explanation:
- All
--subset-size
values must be large enough to demographically match the other ARM, but small enough to swap out any participants whose inclusion is invalid for any reason (e.g. they have family members outside the subset). For example, if the smallest ARM has 3000 subjects, then errors will occur unless you keep the--subset-size
numbers under 1500. If the errors may still occur,you can try reducing the largest--subset-size
further. The new subset size must be less than about 45% of the smallest ARM's size excluding participants with NaNs in the demographic file. The number and percentage of participants with NaNs in each group is printed right after the script begins. - By default,
automated_subset_analysis.py
checks that every subset (a) has the same proportion of twins/triplets as the other ARM, and (b) excludes anyone with family members outside the subset. The--no-matching
flag turns both checks off. It lets you to generate subsets of less than 25, or larger than half the ARM size. It also speeds up subset generation/checking.
Information about this README
file:
- Created by Greg Conan, 2019-10-03
- Updated by Greg Conan, 2021-08-24