In order to run this script, you will need:
- The necessary libraries installed
- Create .env file with the necessary environmental variables
- Obtain input file
The script works with the following versions:
- python
3.11.5
- pandas
2.1.0
- numpy
1.25.2
- matplotlib
3.7.2
- python-dotenv
1.0.0
- scikit-learn
1.3.0
- Jinja2
3.1.2
- xlrd
2.0.1
To install the necessary libraries, run the following code in a Python executer
pip install python-dotenv
To view the version of your libraries, run the following:
pip show python-dotenv
Another alternate method to view all of the installed libraries if the following:
pip show list
An easier and quicker way to install is to run the following to install required packages:
pip install -r requirements.txt
The .env
file needs to have the following environmental variables for the script to work properly:
CTG
: Dataset with medical information on fetal heart rate experiment results.CTG_sheet
: Sheetname where the main information is required from
Your .env
file should look like this:
CTG = 'CTG.xls'
CTG_sheet = 'Data'
The input data that used is from the UC Irvine Machine Learning Repository website, where the dataset is named "Cardiotocography". This daatset contains three tabs, the ones we're interested in are "Description" and "Data", where description focuses on explaining each of the columns used on Data where as the latter has all of the experiments' information that wee need. Here is the link to obtain it:
CTG
: Campos,D. and Bernardes,J.. (2010). Cardiotocography. UCI Machine Learning Repository. https://doi.org/10.24432/C51S4N.- For more information, consult the written paper on this dataset: Diogo Ayres-de-campos, João Bernardes, Antonio Garrido, Joaquim Marques-de-sá & Luis Pereira-leite (2000) SisPorto 2.0: A Program for Automated Analysis of Cardiotocograms, Journal of Maternal-Fetal Medicine, 9:5, 311-318, DOI: 10.3109/14767050009053454
-
Clone the repository:
git clone https://github.com/yourusername/Fetal-Heart-Rate.git cd Fetal-Heart-Rate
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install the required packages:
pip install -r requirements.txt
The next variables from CTG.xls were selected for the DataFrame 'data' with the next name changes:
- LB : bl_FHR
- AC.1 : accel
- FM.1 : fetal_mov,
- UC.1 : uterine_contr
- DL.1 : light_decel
- DS.1 : severe_decel
- DP.1 : prolong_decel
- ASTV
- MSTV
- ALTV
- MLTV
- Width
- Min
- Max
- Nmax
- Nzeros
- Mode
- Mean
- Median
- Variance
- Tendency
- A : calm_sleep
- B : rem_sleep
- C : calm_vig
- D : active_vig
- E (no explanation from source)
- SH : sh_pattern
- AD : ad_pattern
- DE : de_pattern
- LD : ld_pattern
- FS : fs_pattern
- SUSP : sus_pattern
Cutted null values, 3 rows cut
Defined outlier ranges for data with clear gaps depicted on the histograms or extremely long tails Variables from Tendency to sus_pattern are categorical, so they where dropped for the correlation matrix and the EDA to target variable
Comparing the possibility of dropping or capping outliers resulted on finding that the best option was to drop them. This is because capping our columns would change the values of the columns affecting the relations it has with other columns that didn't had outliers.
Dataframes with normalized and standarized values were created for comparison on the modeling stage.
For the model, the algorithm used was Random Forest Classifier due to the problem being of classification.
The three results were made to compare each of the feature engineering techniques to compare and analyze.
-
Import Libraries:
import os import pandas as pd import numpy as np from matplotlib import pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix
-
Set Environment Variables: Ensure that the environment variables
CTG
andCTG_sheet
are set correctly. These variables should point to the Excel file and the specific sheet containing the fetal heart rate data. -
Validate Environment Variables: Check if the environment variables are set and raise an error if they are not:
CTG_FILE_PATH = os.getenv('CTG') CTG_SHEET_NAME = os.getenv('CTG_sheet') if not CTG_FILE_PATH: raise ValueError("Environment variable 'CTG' is not set.") if not CTG_SHEET_NAME: raise ValueError("Environment variable 'CTG_sheet' is not set.")
-
Check File Existence: Verify that the file exists at the specified path:
if not os.path.exists(CTG_FILE_PATH): raise FileNotFoundError(f"The file at {CTG_FILE_PATH} does not exist.")
-
Load Data: Load the data from the specified Excel file and sheet into a pandas DataFrame:
data = pd.read_excel(CTG_FILE_PATH, sheet_name=CTG_SHEET_NAME, skiprows=1)
-
Data Preprocessing: Perform necessary data preprocessing steps such as handling missing values, encoding categorical variables, and scaling numerical features:
# Example preprocessing steps data.dropna(inplace=True) X = data.drop('target_column', axis=1) y = data['target_column'] scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
-
Train-Test Split: Split the data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
-
Model Training: Train a machine learning model (e.g., Logistic Regression) on the training data:
model = LogisticRegression() model.fit(X_train, y_train)
-
Model Evaluation: Evaluate the model on the testing data and print the classification report and confusion matrix:
y_pred = model.predict(X_test) print(classification_report(y_test, y_pred)) print(confusion_matrix(y_test, y_pred))
-
Data Visualization: Use
matplotlib
andseaborn
to visualize the data and the results:sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d') plt.show()
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.