In this assignment, you'll implement a classifier using logistic regression, optimized with gradient descent.
In class, we went over an implementation of linear regression using gradient descent. For this homework, you will be implementing a logistic regression model using the same framework. Logistic regression is useful for binary classification because the sigmoid function outputs a value between 0 and 1.
In this repository, you are given a set of simulated medical record data from patients with small cell and non-small cell lung cancers. Your goal to apply a logistic regression classifier to this dataset, predicting whether a patient has small cell or non-small cell lung cancer based on features of their medical record prior to diagnosis.
As stated above, logistic regression involves using a sigmoid function to model the data. Just like in linear regression, we will define a loss function to keep track of how well the model performs. But instead of mean-squared error, you will implement the binary cross entropy loss function. This function minimizes the error when the predicted y is close to an expected value of 1 or 0. Here are some resources to get you started: [1], [2], [3].
You will find the full dataset in data/nsclc.csv
. Class labels are encoded in the NSCLC
column of the dataset, with 1 = NSCLC and 0 = small cell. A set of features has been pre-selected for you to use in your model during testing (see main.py
), but you are encouraged to submit unit tests that look at different features. The full list of features can be found in logreg/utils.py
.
- [TODO] Complete the logistic regression implementation. (5 points)
- complete the
make_prediction
method - complete the
loss_function
method - complete the
calculate_gradient
method - readable code with clear comments and method descriptions
- complete the
- [TODO] Write appropriate unit tests for each implemented function and for overall training procedure. See
test/test_logreg.py
for some suggested tests. (3 points) - [TODO] Package as a module using
pyproject.toml
and set up GitHub Actions to install your module and run your unit tests. Add a status badge to this README. (2 points)
Fork this repository to your own GitHub account. Work on the codebase locally and commit changes to your forked repository.
You will need following packages:
Try tuning the hyperparameters if you find that your model doesn't converge. Too high of a learning rate or too large of a batch size can sometimes cause the model to be unstable (e.g. loss function goes to infinity). If you're interested, scikit-learn also has some built-in toy datasets that you can use for testing.
We're applying a pretty simple model to a relatively complex problem here, so you should expect your classifier to perform decently but not amazingly. It's also possible for a given optimization run to get stuck in a local minima depending on the initialization. With that said, if your implementation is correct and you found reasonable hyperparameters, you should almost always at least do better than chance.