Hi everyone! How are you doing?
Today the sun is shining, and after some gym time and work, I finally found a moment to write a long article I’ve been wanting to share on my blog.
You know, Andrew Ng has often stressed the importance of deeply understanding the fundamentals of machine learning to become a better practitioner. In his popular courses—such as the renowned Machine Learning course on Coursera—Ng emphasizes that mastering core concepts like linear and logistic regression, neural networks, support vector machines (SVM), and principal component analysis (PCA) is essential for building a strong foundation in the field.
The message is simple. These days, we often hear about machine learning and AI in the context of complex and advanced models. But as with many things in life, it’s crucial to start from the basics. That’s why, before diving into the world of large language models (LLMs), one should first become confident and comfortable with simpler models—models that, in many practical cases, are actually more suitable for solving real-world problems. At least, that’s my perspective as a bioinformatician.
To draw an analogy: you can’t become a great chef if you haven’t first learned how to properly handle a knife.
Today, I’d like to walk you through the fundamentals of linear regression, and I’ll try to do so as clearly and thoroughly as possible. But before we begin, let me set two premises:
-
This article focuses on multiple linear regression, where more than one feature is involved. That’s because this type of model is far more commonly used in real-world applications.
-
To help you fully understand how multiple linear regression works, I’ll briefly touch on some essential concepts from linear algebra. Trust me, it’s worth it—it will make everything much easier to follow.
Let’s begin with the linear algebra concepts that are most important for understanding multiple linear regression.
Foundations of Linear Algebra
As a bioinformatician, I often work with data. But the question I get asked most frequently by friends or family members who are not in this field is:
"What exactly is data?"
Let me put it this way: a computer is a calculator, and machine learning models are mathematical functions that take numbers as input and return other numbers as output.
In this context, data are simply the numbers used by models—numbers on which the computer performs computations. However, in order to be used effectively, these numbers must be organized in a specific structure.
There are several ways to organize and work with data, and one of the key branches of mathematics that allows us to do so is called Linear Algebra.
In detail the data can be presented as:


Of course, we need to communicate this structure to the computer using a programming language.
In Python, for example, we can represent and manipulate data according to the principles of linear algebra using the NumPy library.

import numpy as np
scalar = 23
vector = np.array([1,2])
matrix = np.array([[1,1],[2,2]])
tensor = np.array([[[1,1],[2,2]],
[[3,3],[4,4]]])
print('vector shape:', vector.shape)
print('matrix shape:', matrix.shape)
print('tensor shape:', tensor.shape)
#### OUTPUT
vector shape: (2,)
matrix shape: (2, 2)
tensor shape: (2, 2, 2)
As mentioned, with these numbers (i.e., the data), we can perform a variety of mathematical operations. Below is a list of the most important ones:
1) Addition, Subtraction, Multiplication, and Division
These basic operations can be performed on scalars, vectors, matrices, and tensors.
However, for vectors, matrices, and tensors, it's essential to pay attention to their shape.
The shape describes how data is organized within the structure.
-
The shape of a vector is a single number.
Example:v = [1, 5, 7]→ shape:3 -
The shape of a matrix is defined by two numbers: number of rows × number of columns.
Example:m = [[1,1],[2,2]]→ shape:2 x 2 -
The shape of a tensor is defined by three numbers: rows, columns, and depth (or blocks).
Example:t = [[[1,1],[2,2]], [[3,3],[4,4]]]→ shape:2 x 2 x 2

You cannot perform element-wise operations between vectors, matrices, or tensors that have different shapes.
Don’t believe it? Let’s try to multiply two vectors with different shapes in Python and see what happens!
v1 = np.array([1, 2])
v2 = np.array([2, 5, 6])
v1 * v2
#### OUTPUT
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[120], line 4
1 v1 = np.array([1, 2])
2 v2 = np.array([2, 5, 6])
----> 4 v1 * v2
ValueError: operands could not be broadcast together with shapes (2,) (3,)
⚠️ WARNING: Don't confuse these two concepts!
Here’s a quick bonus tip: it's common to confuse the dimension of data with its shape (I’ve been there too!). But they’re not the same.
- Dimension = the number of axes (or levels, or independent directions).
- Shape = the number of elements along each axis.
Example: an array of shape 2 x 3 has:
- dimension = 2 (because it has two axes: rows and columns)
- shape =
(2, 3)

2) Dot Product
The dot product (also known as matrix multiplication) is an operation between two arrays that don’t need to have the same shape, but they must satisfy one specific rule:
The number of columns of the first matrix must match the number of rows of the second.
This is because the dot product involves multiplying rows from the first matrix with columns from the second, element by element, and then summing the results.
The output is a new matrix.
Example:
If A has shape (2 x 3) and B has shape (3 x 1), then A · B is valid and results in a matrix with shape (2 x 1).

In python:
matrix1 = np.array([[1,1],[2,2], [3, 3]])
matrix2 = np.array([[1,1],[2,2]])
print('Result of dot product\n', matrix1.dot(matrix2))
#### OUTPUT
Result of dot product
[[3 3]
[6 6]
[9 9]]
⚠️ IMPORTANT: Matrix multiplication is not commutative!
In general, the dot product of matrix1 · matrix2 is not equal to matrix2 · matrix1.
This is a fundamental property of matrix algebra:
- You can only perform the dot product if the number of columns in the first matrix matches the number of rows in the second.
- Even when both multiplications are valid (e.g., square matrices), the resulting matrices are usually different.
🔍 Example:
Let:
Abe of shape(2 x 3)Bbe of shape(3 x 2)
Then:
A · Bresults in a(2 x 2)matrixB · Aresults in a(3 x 3)matrix
✅ Both are valid but yield different outputs!
matrix1 = np.array([[1,1],[2,2], [3,3]])
matrix2 = np.array([[1,1,1],[2,2,2]])
print('Result of dot product of matrix1\n', matrix1.dot(matrix2))
print('Result of dot product of matrix2\n', matrix2.dot(matrix1))
#### OUTPUT
Result of dot product of matrix1
[[3 3 3]
[6 6 6]
[9 9 9]]
Result of dot product of matrix2
[[ 6 6]
[12 12]]
3) Reshape
It is possible to convert a matrix into a vector (and vice versa) through an operation called reshape.
It may sound trivial, but reshaping data is a very common operation in machine learning.
For instance, in neural networks, it's often necessary to reshape or flatten a matrix to reduce its shape into a single row or column.
This ensures compatibility with the input layer of the model.


# Reshape vector
vector = np.array([1,2,2,1])
new = vector.reshape(2,2)
print('original vector shape\n', vector.shape, '\n', 'reshape vector\n', new, '\n', 'New vector shape\n', new.shape)
# Reshape matrix
matrix = np.array([[1,2],[2,1]])
new = matrix.reshape(4,1)
print('original matrix shape\n', matrix.shape, '\n', 'reshape matrix\n', new, '\n', 'New matrix shape\n', new.shape)
# When the size of the matrix is unknown, reshape(-1) is also commonly used to reduce the matrix dimension and “flatten” the array into one row.
print('reshaped matrix\n', matrix.reshape(-1))
#### OUTPUT
original vector shape
(4,)
reshape vector
[[1 2]
[2 1]]
New vector shape
(2, 2)
original matrix shape
(2, 2)
reshape matrix
[[1]
[2]
[2]
[1]]
New matrix shape
(4, 1)
reshaped matrix
[1 2 2 1]
4) Transpose
The transpose operation is used to swap the rows and columns of a matrix.
It is a fundamental and widely used operation in machine learning, for example:
- during dot product computations,
- in the analytical solution of linear regression,
- or in the optimization of weights in neural networks.

4) Inverse
The inverse of a matrix is an operation that allows you to obtain another matrix—called the inverse matrix—whose product with the original matrix returns the identity matrix.
The identity matrix has the same shape as the original one but contains 1s along the diagonal and 0s elsewhere.
Take a look at the image below to better understand what I mean.

But if you are wondering how to calculate the inverse of a matrix, here is below in action the formula for calculating the inverse of a matrix:

The formula above helps us understand that not all matrices have an inverse.
In fact, the necessary conditions for a matrix to be invertible are the following:
- The matrix must be square, meaning it must have the same number of rows and columns.
- The matrix must not be singular, that is, its determinant must be different from zero.
Good. Now that we have laid the foundations of linear algebra, we can get into the heart of the article by talking about multiple linear regression.
LINEAR REGRESSION
Linear regression is a model that allows us to predict the values of a target variable based on its relationship with certain predictive variables, also known as features.
This model is nothing more than a function that builds a line to fit the data points from a training dataset, capturing the linear relationship between one or more features and the target variable.
When there is only one feature, we refer to it as simple linear regression, while when there are multiple features involved, we talk about multiple linear regression.




⚠️ Pay attention to the term linear.
It is called linear regression because the relationship between the features and the target variable can be described by a straight line. However, this is not always the case: in many situations, such a relationship cannot be adequately modeled by a straight line, and in those cases it becomes necessary to introduce terms of non-linearity.

1) Linearity of the Relationship Between Features and the Target Variable
As mentioned earlier, the relationship between the features and the target variable must follow a straight line — meaning that the model assumes a linear dependence. In other words, a one-unit change in a feature should lead to a proportional (positive or negative) change in the value of y.
Furthermore, the mean residual error, which describes the average variability not explained by the model, should tend toward zero along the regression line.
Remember: the residual error for each data point i is calculated as the difference between the predicted value from the model and the actual observed value of y at that point.

2) Independence of Residual Errors
The residuals calculated for each observation must be independent of one another, meaning they should not be correlated. This assumption is especially important in time series or spatial data, where an error at one point could influence subsequent ones.
3) Homoscedasticity of Residuals
The residuals should have constant variance across all levels of the independent variables. In other words, the spread of the residuals should remain consistent regardless of the predicted values.
When this assumption is violated, the phenomenon is called heteroscedasticity, and it can undermine the reliability of confidence intervals and hypothesis tests.


4) Normality of Residual Errors
The residual errors should follow an approximately normal (Gaussian) distribution.
A common mistake is to assume that the target variable y must also be normally distributed — this is not true. The normality assumption applies only to the distribution of the residuals, not to y itself.

5) Absence of Multicollinearity Among Features
In the case of multiple linear regression, it is essential that there is no multicollinearity — in other words, the features must be independent and not strongly correlated with each other.
Otherwise, it becomes difficult to accurately estimate the individual effect of each feature on the variability of the target variable y.

6) Influential Points
It is always important to verify whether some points in the training dataset exert too much influence on the regression line, potentially compromising its ability to fit the data correctly and to capture the true relationship between the independent variables (features) and the dependent variable (target).
To identify these influential points, we can use diagnostic statistics such as:
- Leverage
- Cook’s Distance
- DFBETAS
Meeting all the assumptions listed earlier does not, on its own, guarantee the success of a linear regression model. It is still necessary to train the model on a training dataset and evaluate it on a test dataset, using appropriate performance metrics.
As shown in the multiple linear regression formula, there are several coefficients that describe the regression line and thus the relationship between the features and the target variable. In particular, the intercept β₀ represents the value of y when all features are equal to zero. The slope coefficients (β₁, β₂, ..., βn) indicate how much y changes with a one-unit change in each corresponding feature. For example, a β₁ = 2 means that an increase of one unit in x₁, holding other variables constant, results in an increase of 2 units in y.
Once it’s clear that each coefficient reflects the direct effect of a feature on the target variable, it becomes obvious that estimating the right coefficients is crucial for building a model that fits the training data and captures the underlying patterns.
The goal is to create a model that can learn the general relationship within the training data (this is where the bias-variance trade-off comes into play), and then test it on unseen data (the test set), for which the true values of y are known. By comparing model performance on training and test sets, we can optimize the coefficients to improve the model. This process is at the heart of supervised machine learning.
A good analogy is that of a student preparing for an exam: they study and practice on known questions (training), then challenge themselves with new ones (testing). They analyze the mistakes, refine their understanding, and once their performance is strong enough, they take the final exam — where the questions are new and the answers are unknown.
That’s why a supervised model like linear regression must follow a rigorous workflow in order to be successfully used for future predictions.

6) Evaluating the Model: Metrics and Diagnostics
This leads us to make a bold but accurate statement: the true parameters we aim to estimate in linear regression are not the values of y themselves, but rather the slope coefficients ( \beta_n ), as they are directly responsible for predicting y.
I can hear your question already:
"How can we be sure that our model will produce accurate predictions?"
Let’s answer this fundamental question. There are several methods and metrics available to assess the quality of a regression model during both training and testing phases. Here are two essential ones:
COEFFICIENT OF DETERMINATION (( R^2 ))
- Measures how much of the variability in y is explained by the model.
- Denoted as ( R^2 ), it is calculated as:

- ( R^2 ) ranges from 0 to 1. A higher value indicates that the model explains a larger portion of the variance in y, which is desirable.
- However, ( R^2 ) can increase simply by adding more predictors, even if they are not meaningful.
⚠ CAUTION:
When a model includes many predictors (high p), it may become too complex and lose generalizability. In such cases, it is better to use the Adjusted ( R^2 ), which accounts for the number of predictors and provides a more reliable measure of model quality.

RESIDUAL ANALYSIS
Another important diagnostic tool is the analysis of residuals, which are the differences between observed and predicted values.
- A typical residual plot has predicted values on the x-axis and residuals on the y-axis.
- In a good linear model, the residuals should be randomly scattered around zero, showing no specific pattern.
- If we observe a pattern (e.g., a "U" shape), this suggests that the relationship between variables is likely non-linear and that the linear regression model is inadequate.
- In such situations, we may consider switching to a non-linear model, such as polynomial regression, which can better capture complex relationships.

MSE and RMSE
These two metrics are essential for evaluating the predictive performance of a regression model. More generally, both MSE and RMSE can also be used as loss functions, meaning they serve as objective functions to minimize during model training.
Mean Squared Error (MSE)


In simple terms, MSE is the mean of the squared residuals, i.e., the average of the squared differences between predicted and actual values.
A lower MSE means that the model’s predictions are closer to the true values of y.
However, MSE has an important limitation: because it squares the errors, its output is expressed in squared units of the target variable, making its magnitude harder to interpret. To solve this issue, we often use its square root — the RMSE.
Root Mean Squared Error (RMSE)

RMSE improves interpretability by restoring the error unit back to that of the target variable.
For example, if y represents height in meters (m), then:
- MSE is expressed in square meters (m²)
- RMSE is expressed in meters (m)
💡 Bonus
Another option is the MAE (Mean Absolute Error), which measures the average of absolute differences between predictions and actual values.
Since it does not square the errors, it avoids the issue of scale interpretation and is more robust to outliers. However, unlike MSE, MAE is not differentiable at zero, which makes it less ideal for gradient-based optimization.

F-Statistic and P-Value
This metric is specifically relevant in the context of multiple linear regression. When we include several independent variables in a regression model, an important question to ask is:
"Does the model truly explain the relationship between the predictors (X) and the target (y), or are we just adding variables without improving prediction?"
To answer this question, we use a hypothesis test based on the F-statistic and its associated p-value. As with any hypothesis test, we begin by defining our null and alternative hypotheses. See the figure below for a visual representation:

Once the hypothesis system is defined, we compute the F-statistic and its corresponding p-value to decide whether or not to reject the null hypothesis.
The F-statistic compares the variability explained by the model (regression sum of squares) to the variability left unexplained (residual sum of squares), adjusted by their respective degrees of freedom. Formally:
F = (MSR / df_model) / (MSE / df_residual)
Where MSR is the Mean Square Regression and MSE is the Mean Square Error.

A large F value suggests that at least one of the β coefficients associated with the independent variables is significantly different from zero, meaning that the model likely captures a real relationship between the predictors and the target.
However, this is not enough—we need to examine the p-value associated with the computed F-statistic.
If the null hypothesis ( H_0 ) were true, we would expect the computed F-statistic to follow a known reference distribution (the F-distribution). The p-value tells us the probability of observing an F-statistic at least as extreme as the one computed, under the null hypothesis.


In practice, if the p-value is less than a pre-defined alpha threshold (e.g. 0.05), we reject the null hypothesis. This means we conclude that the model provides a statistically significant explanation of the variability in y using the predictors.
📌 Important Note:
The p-value is derived based on the computed F-statistic and the degrees of freedom:
- Numerator (df1) = number of predictors = p
- Denominator (df2) = n – p – 1
These values can be obtained using statistical software (like R or Python), or through F-distribution lookup tables for educational purposes:


OPTIMIZING THE LINEAR REGRESSION MODEL
There’s one last very important topic we need to cover to truly master the linear regression model: understanding how it is optimized during the training and testing phases, in order to obtain the best model possible — one that can generalize well even to unseen data and make accurate predictions.
Before diving in, I think it’s helpful to quickly recap where we are, so we don’t lose the logical thread:
- We’ve reviewed some linear algebra concepts necessary to understand how multiple linear regression works mathematically.
- We’ve explored the assumptions that must be respected in order to apply the linear regression model successfully.
- We’ve discussed the metrics used to evaluate model performance.
In general, when applying a model for a specific task — whether it’s prediction or classification — it’s not enough to run it once on a dataset. It’s essential to optimize the model’s parameters by training it multiple times on a training dataset and evaluating it on a testing dataset, before deploying it on unseen data.
It may sound like a trivial analogy, but I think it fits well: imagine an athlete learning a new skill — for example, a soccer player trying to perfect a bicycle kick. The athlete will train repeatedly, sometimes testing the movement in front of family members, and only when confident enough will they attempt the move in a real match. Similarly, in multiple linear regression, during training the model adjusts and optimizes the slope coefficients based on the loss function. Optimization during training is important — but so is evaluation during testing.
At the end of training and testing, we hope to have a regression model with the optimal coefficients for predicting values of ( y ) as accurately as possible.
So it becomes clear:
The primary goal of training and testing the model is to find the slope coefficients that minimize the prediction error.
This can be expressed mathematically as:
That is, the optimal slope coefficients are those that result in the lowest value of the loss function, which in the case of linear regression is usually the MSE, RMSE, or even MAE.
So the real question becomes:
How do we find the minimum value of the loss function?
Let’s begin by noting that the MSE loss function is a convex parabolic function, meaning it has a single global minimum. This global minimum is what we aim to reach during training, because it corresponds to the best estimated coefficients — and thus the best model for predicting ( y ).
To find the global minimum, we progressively update the slope and intercept values at each epoch of training. Initially, these values are often set to zero or chosen randomly. Then, we assess how the loss function changes with each update.
Imagine a blindfolded man trying to walk down a hill. The only way he can tell if he's moving downhill or uphill is by taking small steps and feeling the terrain under his feet. Is the slope increasing or decreasing?
To assess the slope of the loss function at any given point, we use the derivative of the loss function — a mathematical function that tells us the steepness of the curve at a given point. Graphically, a derivative is a line tangent to the function, and the angle it forms with the x-axis gives the derivative’s value, which indicates the slope.
As we move along the curve of the loss function during training, we compute many such derivatives. The point at which the derivative equals zero is the minimum of the loss function — and this gives us the best possible model.
⚠️ However, keep in mind: in multiple linear regression, we have multiple features — and therefore, multiple coefficients to optimize. This means we compute a partial derivative of the loss function for each coefficient. The set of these partial derivatives forms a vector known as the gradient, which determines the overall direction in which the loss function is changing.
Thus, in multiple linear regression, the global minimum is found when each of these partial derivatives is equal to zero — meaning the global gradient has a slope of zero.
Here’s a mathematical representation to clarify this concept:

Gradient Descent Algorithm
At this point, we’ve clarified that the best predictive model is the one that corresponds to the global minimum of the loss function. This minimum is found by progressively adjusting the values of the β coefficients associated with each independent variable.
To perform these progressive updates, we need a computational algorithm — in the case of linear regression, this algorithm is called gradient descent.
Gradient descent relies on the chain rule to compute partial derivatives for each coefficient. (I won’t go into the details of the chain rule here to keep the flow of the article intact — but if you're interested, let me know in the comments!)
To work properly, gradient descent requires 4 main ingredients:
1) A loss function (e.g., MSE)
This is the function that measures the error of the model's predictions. In linear regression, we typically use the Mean Squared Error (MSE).
2) Initial values for the β coefficients
These are the starting values for the model's parameters.
Before you can “descend” the loss function, you need to start somewhere — typically with all values set to 0, randomly initialized, or selected based on a heuristic.
3) A learning rate
This is a crucial hyperparameter — a setting chosen by the user — that determines the size of the steps taken during gradient descent.
A learning rate that's too small may lead to very slow convergence, while one that's too large may cause the algorithm to overshoot the minimum or diverge altogether.
The learning rate controls how much the β values are updated at each step of training.

4) A tolerance level (stopping criterion)
You can’t run gradient descent forever — at some point, you have to decide when the updates are small enough that you're “close enough” to the minimum.
This threshold is called the tolerance level, and it's usually a small positive value — for example, 0.00001.
In simple terms, when the updates to the β values become smaller than the tolerance, the algorithm stops.
Visual Example
Take a look at the GIF below to understand how gradient descent works in practice. In the example, we see the optimization of the parameters a₀ and a₁ in a simple linear regression model:
yᵢ = a₀ + a₁xᵢ + eᵢ
Notice how the regression line adjusts more and more to the training data as the gradient descent progresses.

Let’s try to rewrite the example seen from the gif above mathematically. But please consider a0 = beta0 and a1 = beta1, so it’s all more elegant.






Just a little practice!!!
Well. To close in beauty I think there is no better way to wrap up all the concepts expressed in this article by getting our hands dirty with a multiple linear regression model on a simulated agronomic type dataset. Below you find the possibility to download the jupiter notebook and the dataset used. Have fun!
📦 Download the notebook + dataset package (.zip)
This concludes the article on linear regression. I hope it has been useful and if you find that there are imperfections or issues to be clarified or on which we can discuss in a constructive way, please comment and I will be happy to answer.
Bye and see you soon!
References
- https://www.geeksforgeeks.org/ml-linear-regression/
- https://www.youtube.com/watch?v=nk2CQITm_eo
- https://machinelearningmastery.com/solve-linear-regression-using-linear-algebra/
-https://journeytodatascientist.blog/2024/02/18/what-is-the-relationship-between-linear-regression-and-linear-algebra/ - https://jaketae.github.io/study/linear-regression/