-
Notifications
You must be signed in to change notification settings - Fork 1
3. Analysis
Most of the analysis code is carried out with the Experiment
class contained in the experiment
module.
The general workflow is:
# get dataset from ingest module
experiment = Experiment(dataset)
normalised_data = experiment.get_normalised_data()
final_results = experiment.get_final_results()
failures = experiment.get_failures_as_dataframe()
model_parameters = experiment.get_model_parameters()
# pass dataframes from Experiment to AnalaysisDataBaseUploader
-
Experiment()
is initialised with one parameter which is a pandas DataFrame fromingest.read_data_from_list()
. - The experiment object creates and stores
Plate
andSample
objects from the dataframe. - It is largely responsible for collecting data calculated from plates and samples and returning them as dataframes.
- A plate object is a pseudo-96-well plate of a single dilution.
- It takes in a dataframe when initialised, this datafame is a subset of
ingest.read_data_from_list()
dataframe corresponding to a single mock 96 well barcode. - It has methods to perform normalisation such as subtracting background.
- This is were the
percentage_infected
feature is calculated. - It has quality-control checks that can flag the entire plate to reporters.
- A sample object holds the data for a sample across all 4 dilutions including 2 replicates.
- It has methods to fit a curve to the data.
- It has quality-control checks to flag the sample to reporters.
- The sample might be a positive control, and undergoes different quality control checks.
Quality control thresholds are defined in the qc_criteria
module. They are either single values such as low_cells_image_region_area_low = 0.6
for global thresholds, or defined in a dictionary if there are variant-specific thresholds.
An imaginary example of a variant-specific quality control check using default-dictionaries for the variant VariantXYZ
that would be defined in qc_criteria.py
.
from collections import defaultdict
positive_control_ic50 = defaultdict(
lambda: {"low": 350, "high": 650}
)
positive_control_ic50["VariantXYZ"] = {"low": 200, "high": 550}
- Plate failures are caused by either the control wells on a plate failing, or a significant number of sample wells failing for an individual well.
- Plate failures should be a
failure.PlateFailure()
class and added to thePlate.plate_failures
set, which is collected byExperiment
.
- Well failures should be a
failure.WellFailure()
class, and added to theSample.failures
set.
The percentage infected feature is used in the IC50 calculation, and is based on plaque area of the sample compared to the plaque area of the virus-only wells.
This is carried out in the Plate
object, and so is done per dilution (pseudo-96 well plate).
The plaque area measurement is converted into Background Subtracted Plaque Area
:
- Calculate the median of
Normalised Plaque area
for the no-virus wells (background
) -
Background Subtracted Plaque Area
=Normalised Plaque area
-background
- Calculate the median of
Background Subtracted Plaque Area
for virus-only well (infection
). -
Percentage Infected
=Background Subtracted Plaque Area
/infection
* 100
For each sample, a dilution - %infected curve is attempted to be fit. Most of the code for this is in stats.py
.
-
The curve fitted is a 4 parameter curve using the hill equation.
-
The curve is fitted on 1/dilution values. So a 1:40 dilution is 1/4 = 0.025 on the x-axis.
-
There are a number of heuristic checks carried out before curve fitting that may classify the same as
no inhibition
,complete inhibition
orweak inhibition
- If no heuristic is found then the curve is fit using
scipy.optimize.curve_fit()
. - If a curve is fit, then the curve parameters are stored in the database table
NE_model_parameters
.
- If no heuristic is found then the curve is fit using
-
The IC50 values is calculated where the curve passes the 50% infected line, not using the EC50 parameter calculated from curve fitting.
-
Reported IC50 values should not fall outside of the tested dilutions (1:40 -> 1:2560), if they do they should be re-labelled as either complete inhbition or weak inhibition.
-
Curves parameters are subject to constraints, such as not going beyond 120% infected.
-
If a curve cannot be fit, or a hampel-outlier test determines the curve is too strange then it is labelled as
model fit failure
. -
The mean-squared-error (MSE) between the dilution points and the fitted curve is calculated and a too-high MSE will cause a well-failure.
top
and bottom
parameters are the wrong way round, this is because the database columns were incorrectly renamed from a
, b
, c
, d
. So they are also the wrong way round in the code to mirror the database. If these are corrected then you would also have to correct the dash-apps etc which rely on the incorrect naming.
This heuristic is if the 2 most dilute values (1:2560, 1:640) are below the 50% threshold (or if the most dilute values are missing, the next 2 most dilute).
This is if the least dilute (1:40) value averages >50% infected but <60% infected. (or if the least dilute values are missing, the next least dilute (1:160).
If values across all dilutions are above the weak threshold of 60% infected.