Skip to content

3. Analysis

Scott Warchal edited this page Feb 17, 2022 · 16 revisions

Most of the analysis code is carried out with the Experiment class contained in the experiment module.

The general workflow is:

# get dataset from ingest module

experiment = Experiment(dataset)

normalised_data = experiment.get_normalised_data()
final_results = experiment.get_final_results()
failures = experiment.get_failures_as_dataframe()
model_parameters = experiment.get_model_parameters()

# pass dataframes from Experiment to AnalaysisDataBaseUploader

Experiment

  • Experiment() is initialised with one parameter which is a pandas DataFrame from ingest.read_data_from_list().
  • The experiment object creates and stores Plate and Sample objects from the dataframe.
  • It is largely responsible for collecting data calculated from plates and samples and returning them as dataframes.

Plate

  • A plate object is a pseudo-96-well plate of a single dilution.
  • It takes in a dataframe when initialised, this datafame is a subset of ingest.read_data_from_list() dataframe corresponding to a single mock 96 well barcode.
  • It has methods to perform normalisation such as subtracting background.
  • This is were the percentage_infected feature is calculated.
  • It has quality-control checks that can flag the entire plate to reporters.

Sample

  • A sample object holds the data for a sample across all 4 dilutions including 2 replicates.
  • It has methods to fit a curve to the data.
  • It has quality-control checks to flag the sample to reporters.
  • The sample might be a positive control, and undergoes different quality control checks.

Quality control

Quality control thresholds are defined in the qc_criteria module. They are either single values such as low_cells_image_region_area_low = 0.6 for global thresholds, or defined in a dictionary if there are variant-specific thresholds.

Variant specific QC checks

An imaginary example of a variant-specific quality control check using default-dictionaries for the variant VariantXYZ that would be defined in qc_criteria.py.

from collections import defaultdict

positive_control_ic50 = defaultdict(
    lambda: {"low": 350, "high": 650}
)
positive_control_ic50["VariantXYZ"] = {"low": 200, "high": 550}

Plate failures

  • Plate failures are caused by either the control wells on a plate failing, or a significant number of sample wells failing for an individual well.
  • Plate failures should be a failure.PlateFailure() class and added to the Plate.plate_failures set, which is collected by Experiment.

Well failures

  • Well failures should be a failure.WellFailure() class, and added to the Sample.failures set.

Percentage infected calculation

The percentage infected feature is used in the IC50 calculation, and is based on plaque area of the sample compared to the plaque area of the virus-only wells. This is carried out in the Plate object, and so is done per dilution (pseudo-96 well plate).

Background subtraction

The plaque area measurement is converted into Background Subtracted Plaque Area:

  1. Calculate the median of Normalised Plaque area for the no-virus wells (background)
  2. Background Subtracted Plaque Area = Normalised Plaque area - background

Percentage infected

  1. Calculate the median of Background Subtracted Plaque Area for virus-only well (infection).
  2. Percentage Infected = Background Subtracted Plaque Area / infection * 100

Curve fitting

For each sample, a dilution - %infected curve is attempted to be fit. Most of the code for this is in stats.py.

  • The curve fitted is a 4 parameter curve using the hill equation.

  • The curve is fitted on 1/dilution values. So a 1:40 dilution is 1/4 = 0.025 on the x-axis.

  • There are a number of heuristic checks carried out before curve fitting that may classify the same as no inhibition, complete inhibition or weak inhibition

    • If no heuristic is found then the curve is fit using scipy.optimize.curve_fit().
    • If a curve is fit, then the curve parameters are stored in the database table NE_model_parameters.
  • The IC50 values is calculated where the curve passes the 50% infected line, not using the EC50 parameter calculated from curve fitting.

  • Reported IC50 values should not fall outside of the tested dilutions (1:40 -> 1:2560), if they do they should be re-labelled as either complete inhbition or weak inhibition.

  • Curves parameters are subject to constraints, such as not going beyond 120% infected.

  • If a curve cannot be fit, or a hampel-outlier test determines the curve is too strange then it is labelled as model fit failure.

  • The mean-squared-error (MSE) between the dilution points and the fitted curve is calculated and a too-high MSE will cause a well-failure.

⚠️ The top and bottom parameters are the wrong way round, this is because the database columns were incorrectly renamed from a, b, c, d. So they are also the wrong way round in the code to mirror the database. If these are corrected then you would also have to correct the dash-apps etc which rely on the incorrect naming.

Complete inhibition

This heuristic is if the 2 most dilute values (1:2560, 1:640) are below the 50% threshold (or if the most dilute values are missing, the next 2 most dilute).

Weak inhibition

This is if the least dilute (1:40) value averages >50% infected but <60% infected. (or if the least dilute values are missing, the next least dilute (1:160).

No inhibition

If values across all dilutions are above the weak threshold of 60% infected.