Skip to content

A guide for creating a program that predicts the price of a car by using a linear function train with a gradient descent algorithm.

Notifications You must be signed in to change notification settings

leogaudin/ft_linear_regression

Repository files navigation

📈 ft_linear_regression

ft_linear_regression is the first project of the Machine Learning branch at 42.

The goal of this project is to introduce us to the basic concepts of ML.

We will have to implement a program that predicts the price of a car by using a linear function train with a gradient descent algorithm.

The implementation of this project will be done in Python.

I will only explain the concepts required for the second part of the project.

The first part is literally read a file, parse floats and use them to compute user input.

Table of Contents

Concepts needed for the project

Linear Regression

Simple linear regression aims to represent the relationship between a feature and a target variable.

The formula for a linear regression is:

$$ y = θ_{0} + θ_{1}x $$

Where:

  • $y$ is the target variable
  • $x$ is the feature
  • $θ_{0}$ is the bias
  • $θ_{1}$ is the coefficient

It's literally a line equation, like $y = mx + b$ in school.

In our case, the target variable is the price of a car and the feature is the mileage.

For example, if we find out that $price = -0.5 * mileage$, it means that for each unit of mileage, the price of the car decreases by 0.5.

Loss Function

In ML, a loss function is used to quantify how well a model is performing, its precision.

It measures the difference between the predicted values and the actual values.

The loss function used in our linear regression is the Mean Absolute Error (MAE).

$$ MAE = \frac{1}{m} \sum_{i=1}^{m} | \hat{y_{i}} - y_{i} | $$

Where:

  • $m$ is the number of samples
  • $\hat{y}$ is the predicted value
  • $y$ is the actual value

To make it simpler, it means we are going to estimate the price for a mileage, subtract it to the actual price in the data we have, do this for all the data points, and then divide the sum by the number of data points.

For example, if we have 3 data points:

  • $y_{1} = 1000$ and $\hat{y}_{1} = 900$
  • $y_{2} = 2000$ and $\hat{y}_{2} = 1900$
  • $y_{3} = 3000$ and $\hat{y}_{3} = 3100$

The MAE will be:

$$ MAE = \frac{|1000 - 900| + |2000 - 1900| + |3000 - 3100|}{3} = 100 $$

Here, the MAE is 100, which means that our model is off by 100 on average.

Unless your data points are perfectly aligned, the MAE will never be 0, it is totally normal to have a loss.

Now that we know how to measure the precision of our model, we need to find the best $θ_{0}$ and $θ_{1}$ that minimize this loss function.

Gradient Descent

Gradient descent is an optimization algorithm that consists of finding the minimum of a function by iteratively getting closer to it.

In our case, the function we want to minimize is the loss function.

The algorithm works as follows:

  1. Initialize the $θ_{0}$ and $θ_{1}$ to 0
  2. Calculate the gradient of the loss function, that is to say "what do $θ_{0}$ and $θ_{1}$ miss to be optimal?"
  3. Update the $θ_{0}$ and $θ_{1}$ in the opposite direction of the gradient (subtract it)
  4. Repeat steps 2 and 3 until the loss function converges (i.e. stagnates)

The formula to update the $θ_{0}$ and $θ_{1}$ is:

$$ θ_{0} = θ_{0} - α \frac{1}{m} \sum_{i=1}^{m}(\hat{y}_{i} - y_{i}) $$

$$ θ_{1} = θ_{1} - α \frac{1}{m} \sum_{i=1}^{m}((\hat{y}_{i} - y_{i}) * x_{i}) $$

Where:

  • $α$ is the learning rate
  • $m$ is the number of samples
  • $\hat{y}$ is the predicted value
  • $y$ is the actual value
  • $x$ is the feature

The learning rate is a hyperparameter that controls how much we update the $θ_{0}$ and $θ_{1}$ at each iteration.

If it's too high, we might overshoot the minimum, if it's too low, we might take too long to converge.

Why do we use these formulas to update the $θ_{0}$ and $θ_{1}$?

  • $θ_{0}$'s update is straightforward, it's just the average of the errors: if your line is ≈ 100 above the actual values, you just need to lower it by 100.

  • $θ_{1}$'s update is a bit more complex, given that it's a coefficient. We need to correct its offset, but also its slope (how inclined it is).

You might have noticed that the two update formulas I provided are literally the ones in the subject:

$tmpθ_{0} = θ_{0} - learningRate * \frac{1}{m} \sum_{i=0}^{m - 1} (estimatePrice(mileage[i]) - price[i])$

and

$tmpθ_{1} = θ_{1} - learningRate * \frac{1}{m} \sum_{i=0}^{m - 1} (estimatePrice(mileage[i]) - price[i]) * mileage[i]$

Feature Scaling

If we implement this code as is, we might run into a problem: the mileage is in the thousands, while the price is in the tens of thousands.

This means that the $θ_{1}$ will be updated more for the mileage than for the price, which is not what we want.

Anyways, the code would crash because $θ_{1}$ would be way too high.

To avoid this, we need to scale the features.

In our case, we will use the standardization method (because the min-max scaling was a pain to reverse later).

$$ x_{scaled} = \frac{x - μ}{σ} $$

Where:

  • $x$ is the feature
  • $μ$ is the mean of the feature
  • $σ$ is the standard deviation of the feature

The mean is used to center the data around 0

The standard deviation is used to scale the data, so it has a variance of 1 (i.e., the data points are equally spread on the x-axis)

For example, if we have the following mileages:

  • $mileage = [1000, 2000, 3000]$
  • $μ = 2000$
  • $σ = 816.5$

The scaled mileage will be:

$$ mileage_{scaled} = \left[ \frac{1000 - 2000}{816.5}, \frac{2000 - 2000}{816.5}, \frac{3000 - 2000}{816.5} \right] = [-1.22, 0, 1.22] $$

Wrapping it up

Now that we know all the concepts needed for the project, we can summarize the steps to implement the linear regression:

  1. Load the data
  2. Scale the feature (the mileage)
  3. Initialize $θ_{0}$ and $θ_{1}$ to 0
  4. Make a naive prediction of $price = θ_{0} + θ_{1} * mileage$ for every data point
  5. Calculate the average error
  6. Update $θ_{0}$ and $θ_{1}$ accordingly
  7. Repeat steps 4 to 6 until the loss converges
  8. Save $θ_{0}$ and $θ_{1}$ to a file

Resources

About

A guide for creating a program that predicts the price of a car by using a linear function train with a gradient descent algorithm.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published