Skip to content

Commit

Permalink
Add documentation on the modelling stage, and change model used
Browse files Browse the repository at this point in the history
  • Loading branch information
Bear-Witness-98 committed Jul 19, 2024
1 parent a68d6d4 commit f9ad8cf
Show file tree
Hide file tree
Showing 2 changed files with 111 additions and 33 deletions.
8 changes: 4 additions & 4 deletions challenge/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier


Expand All @@ -26,7 +25,7 @@ class DelayModel:
THRESHOLD_IN_MINUTES = 15

def __init__(self):
self._model = LogisticRegression()
self._model = XGBClassifier()

def _get_min_diff(self, data: pd.Series) -> float:
"""
Expand Down Expand Up @@ -133,10 +132,11 @@ def fit(self, features: pd.DataFrame, target: pd.DataFrame) -> None:
# get values to compensate unbalancing
n_y0 = len(target[target[target_column] == 0])
n_y1 = len(target[target[target_column] == 1])
scale = n_y0 / n_y1

# instantiate model and fit
self._model = LogisticRegression(
class_weight={1: n_y0 / len(target), 0: n_y1 / len(target)}
self._model = XGBClassifier(
random_state=1, learning_rate=0.01, scale_pos_weight=scale
)
self._model.fit(features, target[target_column])

Expand Down
136 changes: 107 additions & 29 deletions docs/challenge.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,89 @@
# Challange Notes

## Reviewing the notebook
Notes during the development of the challange.

### Generalities
## Part I Model selection and transcription

In the documentation (README) it is said that the data has a column named
DATA-I when talking about the additional DS features, but this column does not
exist. Instead, from the code and the description, we can assume this column
name should be FECHA-I.
### Reviewing the DS's notebook.

### Feedback to the DS
There are quite some issues with how the notebook was presented. The first issue
is that it not runs properly, due to the abscense of the `x=` and `y=` kwargs
missing in the barplots. A second issue is that the `.set` method of seaborn
was deprecated in favour of `.set_theme`; also, this method could only be called
once. And lastly, some subtitles in markdown did not corresponded to the cells
they had below along the First Sight on the data analysis. While not critical,
nor relevant to the result they do not give a good impression of the care taken
while summarising the DS's experiments into this notebook. These errors where
fixed to properly execute the notebook (alongside with the change from
`../data/data.csv` to `data/data.csv`).

I would like to ask a few things to the DS as it seems unclear why he did some
of the selections he did. I am supposing this is a kind of summary notebook, so
some details may have been missed in the condensation of his/her analysis.
#### Data analysis

The colour lightblue is a really poor choice for a contrast between the graph
background and the bars.

The "flight by destination" graph is so cramped up, its hard to see which bar
corresponds to which destination.

#### Features generation

The additional features the DS computed makes sense at first sight, but the
target doesn't match up with what was asked. If the idea was to predict the
**probability** of a flight being delayed or not, then regressor should be
trained, and this information on "how much delay" the flight had, could be
implemented into this encoding. This may have more to do with the modelling part
than with the feature generation, but it is very tightly coupled. Also, the
value of 15 minutes to be considered a delay is not appropriately justified.

#### Data analysis II

First cell of this part shows bad coding practices.

#### Training

`training_data` was defined in second cell but not used again.

The selection of three features is done with no explanation whatsoever. A binary
encong of the `TIPOVUELO` may benefit the model by reducing the input dimension.
a cyclical encoding of the months could benefit the model by: reducing the input
dimension, and, redundantly, adding the cyclicality nature of the months to the
model.

On the xgboost training, the DS added an artificial threshold to its outputs,
this wasn't needed as the model already outputs 0/1. It should also be a red
flag that the model predicted always 0, meaning that it was not capturing enough
the information in the dataset. This result could be used as an argument of the
unbalancing of the dataset (in addition to the Logistic Regression below). By no
means should this trained model used to get the most important features, as it
would take the features that mostly predict the 0 value.

Notice that the top 10 features that are selected, do not match with the top 10
features of the graph (at least in this run).

No mention is done on this whatsoever, a comment on such calling result
should be done.

#### Training II

For a second round of training, the DS considered the most important features to
be more relevant than the balancing, and made the experimentation as such,
getting expected results when the balancing is not done. The results without
the balancing lose meaning in this context.

Though, with the balancing, some better results are obtained. But, there is no
explanation whether this is enough or not for the buisness. I don't think that
should be explicilty in the DS's analysis, but would be a nice to have as a
conclusion of the work.

#### Conclusions and next steps:

I would send back an email to the DS, asking for some more explanations of the
results, and their interpretations. Pointing out some of the mistakes I've
found. Something of the sort:

1. I see that at the data splitting step, you decided to keep just three
features. Why keep these features specifically? It seems to me that other
features, as the destination, may encode some information on the delay process;
but maybe I am missing something.
features, as the destination, may encode some information on the delay process.
2. On the target encoding, I understood that we wanted to predict the
probability of a delay on a specific flight. Your encoding just predicts whether
the flight would be more than 15 minutes delayed or not. Have you came up with
Expand All @@ -34,30 +99,43 @@ this decision.
4. The 10 features you selected to train with, are not the top 10 I am seeing in
the graph of feature importance from the xgboost. I think it might be due to
some random see issue. Would you care to go over it?
5. Are these metrics enough from a business perspective? Or they are expected to
be improved on further iterations?

Also pointed out some comments in the code, but wouldn't bother the DS with
them, as it was not the main focus of the work.
them, as it was not the main focus of the work, and I haven't found clear bugs
on the used prediction features, that's what I would use later. Though, I would
report any bugs in the computation of features if I found one.

##### Model selection

At the first iteration of the challange, I choose the LogisticRegression as the
model of choice, for its simplicity, ease of explainability (which feature
it gives more weight to), and because it is part of scikit learn, our training
framework. But given that there are a lot of unsolved mysteries, and we may need
more predictive power, I will now choose the xgboost model. To solve any
additional issue we might have early, and for easiness of retraining. With the
added benefit that we don't loose that much explainability.

##### Feature selection

I will use the features selected by the DS as-is, because it's the best baseline
we have and it does not compromises time-to-delivery on a first iteration.
Putting this list as a parameter and updating it if it needs to isn't too hard.

##### Code-wise

### How to continue
Implemented the functions as similar to what they are in the notebook as
possible, improving as much code as I could.

I will proceed by moving the pipeline the DS implemented here as similar as
possible to the production pipeline, as this is the best predictions we have
yet. Though, I will also try to make it as versatile as possible in the feature
selection stage, that I think was the one that may need a revision from his
side; though, without compromising the time-to-production of the system.
Added a code to be executed if the script was run with
`python challenge/model.py` to generate and save a trained model in the path:
`models/model.pkl`. Pickle is not the best format to use, due to compatibility
reasons, but as I just saved the LogisticRegression, and used it inside the same
class as it was generated in, I don't expect errors. And converting it to a more
general format (like ONNX) and then running it properly is not trivial.

#### On the model selection
The final model might be modified if another stage of experimenting would be
done by the DS but, as both linear regression and xgboost support the same
interface for prediction, it shouldn't be that much of an issue to change it
afterwards if implemented appropriately.

I will go with the linear regression, as it's execution time is determined only
by the number of features, and not on a highly tunned hyperparameter (as the
number of trees in xgboost). Also, it has the advantage that we can limit
ourselves to only one framework (scikit learn), and have less imcompatibility
issues when trying to move our model to production.

## Part II API developement

Expand Down

0 comments on commit f9ad8cf

Please sign in to comment.