Skip to content

Step by step Guide to data assessment, cleaning and analysis using a hypothetical phase II clinical trial dataset in Python

Notifications You must be signed in to change notification settings

UChisom/auralin_clinical_trial_data_analysis

Repository files navigation

Auralin Clinical Trial Data Analysis

Cleaning and analysis of dirty clinical trial (phase II) data testing the efficacy of a new drug (Auralin).

Outline

  1. Project Dataset
  2. Project Description
  3. Table Descriptions
  4. Additional useful information
  5. Reference

Project Dataset

The dataset has three tables containing the results for a clinical trial on a new drug. It was curated by Udacity staff alongside medical professionals so it mimics the originality and dirtiness common with medical data.

700 patients passed through the trial phase and their results are split into three tables namely the patients table containing patient information, the treatments table comparing results of each drug administered during the trial phase, and the adverse reactions table containing observed side effects from each drug on affected patients.

Project Description

  • This analysis was carried out using the Python language in the Jupyter notebook.

  • It serves as an easy guide through assessing, cleaning, and analyzing data.

  • The data is both untidy (structural issues) and dirty(quality issues), but this will be a simple guide to clean up the data for a brief analysis.

  • The Auralin clinical trial dataset contains medical data from a phase clinical trial on a new drug Auralin.

  • Auralin is planned to be a replacement for the currently injected drug for diabetic patients. Diabetic patients are presently administered a drug called Novodra through a syringe multiple times a day. This is not only uncomfortable and stressful but also has some physical side effects. Some of its side effects include itching around the region and sometimes swelling.

  • On the other hand, Auralin is targeted to be administered orally and prevent the hassle and discomfort of taking Novodra. To deploy the drug however, it must successfully pass the clinical trial stage. This dataset contains the phase II clinical trial results.

Table Descriptions

To familiarize with the terms and uses of columns in the dataset, here's a reference guide. You'll also find this information within the notebook to look back on.

patients: Table description

  • patient_id: the unique identifier for each patient in the Master Patient Index (i.e. patient database) of the pharmaceutical company that is producing Auralin
  • assigned_sex: the assigned sex of each patient at birth (male or female)
  • given_name: the given name (i.e. first name) of each patient
  • surname: the surname (i.e. last name) of each patient
  • address: the main address for each patient
  • city: the corresponding city for the main address of each patient
  • state: the corresponding state for the main address of each patient
  • zip_code: the corresponding zip code for the main address of each patient
  • country: the corresponding country for the main address of each patient (all United states for this clinical trial)
  • contact: phone number and email information for each patient
  • birthdate: the date of birth of each patient (month/day/year). The inclusion criteria for this clinical trial is age >= 18 (there is no maximum age because diabetes is a growing problem among the elderly population)
  • weight: the weight of each patient in pounds (lbs)
  • height: the height of each patient in inches (in)
  • bmi: the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m2 where kg is a person's weight in kilograms and m2 is their height in meters squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. The inclusion criteria for this clinical trial is 16 >= BMI >= 38.

treatments Table description

350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before. All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:

  • 175 patients switched to Auralin for 24 weeks
  • 175 patients continued using Novodra for 24 weeks

treatments columns:

  • given_name: the given name of each patient in the Master Patient Index that took part in the clinical trial
  • surname: the surname of each patient in the Master Patient Index that took part in the clinical trial
  • auralin: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) and the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (short form 'u'), which is the international unit of measurement and the standard measurement for insulin.
  • novodra: same as above, except for patients that continued treatment with Novodra
  • hba1c_start: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The HbA1c test measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
  • hba1c_end: the patient's HbA1c level at the end of the last week of treatment
  • hba1c_change: the change in the patient's HbA1c level from the start of treatment to the end, i.e., hba1c_start - hba1c_end. For Auralin to be deemed effective, it must be "non-inferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

adverse_reactions: Table description

adverse_reactions columns:

  • given_name: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
  • surname: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
  • adverse_reaction: the adverse reaction reported by the patient

Additional useful information:

  • Insulin resistance varies person to person, which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
  • It is important to test drugs and medical products in the people they are meant to help. People of different ages, races, sex, and ethnic group must be included in clinical trials. This diversity is reflected in the patients table.
  • Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally, you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a similar debate exists for variable names). The auralin and novodra column names are probably not descriptive enough, but you'll address that later so don't worry about that for now.

Reference

The Data was curated by Udacity and some medical professionals to ensure resemlance to reall life clinical trial datasets. I obtained it from the Data Analyst nanodegree program on Udacity

About

Step by step Guide to data assessment, cleaning and analysis using a hypothetical phase II clinical trial dataset in Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published