Hello everyone. Christmas is already here. I currently have three monitors in front of me. On one monitor, there's a single cell multi-omics analysis running in RStudio, on another an AI model for image classification tasks is fine-tuning in a Jupyter Notebook, and finally, I have a screen where I'm writing this new blog post. Often, my desire to learn overwhelms me. I always have to balance the time I spend learning new things with my work as a bioinformatician that I need to carry forward. For now, I keep repeating to myself like a mantra: "Don't rush to climb the mountain." Anyway, today I want to finish the discussion on Statistical Inference and thus complete my review of basic statistics.

Ready? Ok, let's start.
In the last article, we talked about how scientists compare different groups based on a certain variable under study or compare different variables with each other at the level of the same group of statistical units. But there's more. Often, scientists want to model a phenomenon to understand it better.
Yes, I know. You're wondering what exactly a "model" is. Like you, I hear this word numerous times, in statistics, bioinformatics, machine learning, and AI, but I struggle to grasp its real meaning.
The definition of a model that fits best for me is:
"A model is a more or less simplified representation of reality."
Imagine wanting to understand what a Ferrari is like. If, like me, you can't afford to buy one, you might purchase a scale model of your favorite Ferrari. Of course, you can't get on it and hope to speed around like mad, but at least you're able to understand how it's made, both outside and maybe inside, since the model is a simplified representation of the actual car.

In the same way, the normal distribution is a model of how the probability values of certain variables, especially biological ones, tend to distribute themselves on a plot with variable values on the x-axis and probability values on the y-axis (see image below). For example, if we imagine evaluating the probability distribution of the variable "weight of an egg," we'd notice that these tend to take a shape similar to a normal distribution, so we can use the latter as a model, or simplification, of the real distribution. This is advantageous because many calculations and statistical considerations are easier to apply to the normal distribution rather than to the real distribution.

When we use a model, we must always ask ourselves: "What do I need it for?"
A model can have three main purposes:
-
Describing a complex phenomenon: When studying an extremely complex phenomenon where many variables interact with each other, we can use models that simplify the phenomenon in order to describe it best. For example, a linear regression model can be used to describe how the variable y varies with the variable x in a complex process because it has several confounding factors.
-
Interpreting a phenomenon: A model can help us interpret a certain phenomenon, for example, a linear regression model allows us to interpret the relationship between the variable y and the variable x under examination.
-
Predicting the outcome of a phenomenon: In my opinion, predicting an outcome is one of the most fascinating purposes of models. Again taking the linear regression model as an example, it's notable how it is also used to predict the value of a variable y given a specific value of the variable x.
I've mentioned the linear regression model several times because, as said in the previous article, we talked about comparing variables using specific statistical tests to answer questions like: "Is variable x correlated with variable y?" But how do I know how y is actually correlated with x? Or how can I interpret the degree of correlation between y and x? And again, how can I use the correlation to predict the future value of y given a certain value of x?
To these different questions, we have one answer: The linear regression model is applied.
The Linear Regression Model
The linear regression model enables us to derive various insights from the relationship between two variables measured across a series of statistical units. Typically, these two variables are defined as:
-
Variable X, also known as the explanatory variable, regressor or independent variable. It's the variable that explains the variation in the values of variable Y.
-
Variable Y, also known as the response variable or dependent variable. It's the variable that undergoes variation in values as an effect of the variation in the values of X.
Practically speaking, the model is a mathematical function. Imagine it as a machine that takes objects as input and returns a final object as output.

Graphically, we can see how this function constructs a line that approximates the cloud of x,y points, which defines the relationship between X and Y as depicted in the Scatter Plot.

Analyzing the formula of the linear regression model, three fundamental values immediately stand out:
-
β0, known as the intercept, is the value of variable y when variable x is 0.
-
β1, known as the slope coefficient or regression coefficient of the line, determines the slope of the line. It quantifies the effect of x on y and explains the average variation in y for a unit variation in x.
- If β1 > 0, it indicates a direct or positive relationship between variable x and variable y. Practically, this means that y increases on average when the value of x increases by one unit.
- If β1 < 0, it suggests that if variable x increases by one unit, variable y tends to decrease on average.
-
ε, known as the error term or residual of the model. When evaluating the relationship between x and y, it's essential to consider that there might be various other known or unknown factors influencing the value of y. The pure mathematical relationship of the straight line formula does not account for these factors that might confuse the relationship between x and y; therefore, a certain error expressed through ε is necessary.

For example, imagine studying the relationship between human Height (x) and human Weight (y). If x = 180 cm, then according to the straight line formula, y = β0 + β1*x, the y value corresponding to x should be 78 Kg, but as you can see from the Scatter plot below, there are different xy combinations for x = 180 cm. This is because in nature, we rarely have such a direct, let me say 'mathematically trivial,' situation; there are various other variables that contribute to the variation in y, such as the subject's lifestyle, diet, whether the scale is faulty, and so on.

For this reason, to truly capture the relationship between height and weight, it's necessary to create a more 'flexible' model that accounts for error through the value of ε, which for each specific statistical unit defined by i is calculated as follows:

Hence, it's apparent that the values of β0 and β1 define the true nature of the model and its ability to approximate the relationship between x and y. Their selection is therefore crucial. It's important to note that these are estimated values, but how exactly can we determine the correct values for β0 and β1?
To estimate the values of β0 and β1, the Method of Least Squares is used, which is a function that minimizes the sum of the squared distances between the points (observed xy values) and the line (estimated xy values) on the graph. Moreover, since the distance between a specific point and the line is equal to the value of εi (residual for a specific statistical unit i), the sum of the squared distances between the points (observed xy values) and the line (estimated xy values) is equivalent to the total squared residual (ε^2).
Let's look at the formulas for estimating the values of β0 and β1:

However, estimating the values of β0 and β1 is not sufficient. Indeed, to be sure of having obtained significant values, especially β1, it's necessary to perform a hypothesis test as shown in the image below:

By estimating the values of β0 and β1, we thus create a certain linear regression model, but how do we know if the model used is really 'good,' that is, capable of describing, interpreting, and predicting the relationship between a variable x and a variable y?
To define the goodness of a linear regression model, the Coefficient of Determination can be used. This measures how well the line fits the points given by the xy pairs. In practice, it tells us how much of the variability of y is truly attributable to variable x. Obviously, this also measures the reliability of the predictions of y made through the model used.

It's important to clarify that for effectively applying a linear regression model, certain key assumptions must be met. These assumptions are crucial to ensure the accuracy of parameter estimates and the validity of the conclusions drawn from the model. The main assumptions of linear regression include:
A) Linearity: The relationship between the independent variables and the dependent variable must be linear. This means that changes in an independent variable should lead to proportional changes in the dependent variable.
To verify the linearity prerequisite in a linear regression model, various methods and tests can be employed. These methods aim to assess whether the relationship between the independent variables and the dependent variable is indeed linear. The most common include:
1) Graphical Analysis of Residuals:
- Plot of Residuals Against Predicted Values: A plot of residuals (the difference between observed and predicted values) against the predicted values from the regression can reveal non-linear patterns. In a properly specified model with a linear relationship, the points should be randomly distributed around zero without forming specific patterns.
- Scatter Plot Between Variables: A scatter plot between independent and dependent variables can help identify if the relationship is linear. If the relationship appears non-linear, it may be necessary to transform the variables or use a non-linear model.
2) Formal Linearity Tests: - Lack of Fit Test: This test compares the linear regression model with a more flexible model that may fit the data better. If the more flexible model fits significantly better, this suggests that the linear model may not be adequate.
- Harvey-Collier Test: This test is specific for linear regression models and checks the linearity of the relationship between variables.
3) Variable Transformations: - Sometimes, the relationship between variables can be made linear through variable transformations. For example, applying logarithms, square roots, or other mathematical transformations. After transformation, scatter plots and residual plots can be re-examined to assess linearity.
It is important to note that no test can definitively confirm linearity; rather, these methods provide clues as to how well the linearity assumption is met. If tests suggest that the relationship is not linear, alternative models or transformations of the variables may need to be considered to better capture the relationship between the variables.
B) Independence of Errors: The errors (residuals) of the model, i.e., the differences between observed values and those predicted by the regression, must be independent. Correlation among errors can indicate issues like autocorrelation.
To verify the assumption of error independence in a linear regression model, i.e., to ensure that the residuals (errors) of the model are not correlated with each other, the following methods are primarily used:
-
Graphical Analysis of Residuals:
- Residual Plot Against Time or Data Order: If the data have a temporal or sequential order, a plot of residuals as a function of time or the sequence of observation can help detect autocorrelation. In the presence of independence, residuals should appear random and not show obvious patterns or trends.
-
Durbin-Watson Test:
- This is one of the most commonly used tests to detect the autocorrelation of residuals, particularly first-order autocorrelation. The test produces a value ranging from 0 to 4, where a value around 2 suggests the absence of autocorrelation, values significantly less than 2 indicate positive autocorrelation, and values significantly greater than 2 indicate negative autocorrelation.
-
Breusch-Godfrey (LM Test):
- This test is more flexible than the Durbin-Watson test and can be used to detect higher-order autocorrelations. It is particularly useful when autocorrelation is suspected not to be limited to the first order.
-
Ljung-Box Test:
- The Ljung-Box test is primarily used in time series analysis to test for the absence of autocorrelation at various time lags. It is useful for verifying the independence of residuals in models with time series data.
-
ACF (Autocorrelation Function) Plot:
- The autocorrelation function plot can be used to examine autocorrelation at various lags. In a model with independent residuals, autocorrelations should be near zero for all lags.
C) Homoscedasticity: The variance of errors should be constant across all levels of the independent variables. If the variance changes (heteroscedasticity), it can lead to inefficient estimates.


To verify the assumption of homoscedasticity in a linear regression model, several tests and methods can be used. This assumption implies that the variance of errors (residuals) of the model should remain constant across all levels of the independent variable. The most common methods for testing homoscedasticity include:
-
Graphical Analysis of Residuals:
- Residual Plot Against Predicted Values: A plot showing residuals (the difference between observed and predicted values) against the predicted values is a fundamental diagnostic tool. In the presence of homoscedasticity, residuals will be randomly distributed around the horizontal zero line without showing patterns of expansion or contraction.
-
Breusch-Pagan Test:
- This is one of the most commonly used tests to detect heteroscedasticity. The Breusch-Pagan test checks the null hypothesis of homoscedasticity against the alternative hypothesis of heteroscedasticity. A significant result indicates the presence of heteroscedasticity.
-
White Test:
- The White test is another popular method for detecting heteroscedasticity. It is similar to the Breusch-Pagan test but does not require the specification of a particular form of heteroscedasticity, making it more general.
-
Goldfeld-Quandt Test:
- This test compares the variances of residuals from two different subsets of the data. If the variances differ significantly, the test suggests the presence of heteroscedasticity.
-
Levene's Test:
- Although more commonly used in analysis of variance (ANOVA), Levene's test can be applied to test the homogeneity of the variance of residuals in a regression model.
It's important to remember that no test can definitively confirm homoscedasticity; rather, these tests provide indications of the presence or absence of heteroscedasticity. If one or more of these tests indicate heteroscedasticity, it may be necessary to modify the regression model or use robust statistical methods that are less sensitive to violations of homoscedasticity.
D) Normality of Errors: For small samples, the errors must be normally distributed. This assumption is particularly important for hypothesis testing and for constructing confidence intervals. For large samples, thanks to the Central Limit Theorem, this assumption becomes less critical.
To verify the assumption of normality of residuals in a linear regression model, several tests and methods can be used. This assumption is particularly important for small samples and for inferences based on hypothesis tests and confidence intervals. The most common methods for testing the normality of residuals include:
-
Graphical Analysis:
- Q-Q (Quantile-Quantile) Plot: A Q-Q plot compares the quantiles of residuals with the quantiles of a normal distribution. If the residuals follow a normal distribution, the points in the Q-Q plot will approximately align along a straight line.
- Histogram of Residuals: A histogram can be used to assess the shape of the distribution of residuals. A normal distribution will appear as a symmetric bell curve.
-
Shapiro-Wilk Test:
- This is one of the most powerful tests for normality, especially for small sample sizes. The test compares the data to a normal distribution, and a significant result indicates a deviation from normality.
-
Kolmogorov-Smirnov Test:
- This test, with a specific version for small samples (Lilliefors Test), compares the cumulative distribution of the data with that of a normal distribution. Again, a significant result suggests that the data do not follow a normal distribution.
-
Anderson-Darling Test:
- Similar to the Kolmogorov-Smirnov test, the Anderson-Darling test gives more weight to the tails of the distribution of residuals, making it more sensitive to deviations from normality in the tails.
-
D'Agostino-Pearson Test:
- This test combines skewness (asymmetry) and kurtosis (peakedness) to assess normality. A significant result indicates that the residuals are not normally distributed.
It's important to remember that no test can definitively confirm normality. Additionally, for large samples, the violation of normality of residuals becomes less critical due to the Central Limit Theorem. However, if tests suggest a significant deviation from normality, it may be necessary to consider data transformations or alternative models.
Great. At this point, we have explained how a scientist can use a linear regression model to describe, interpret, and make predictions about the relationship between a variable x and a variable y. However, nature is often very complex, and to accurately predict the value of a certain variable y, it's not enough to consider just one type of regressor (variable x) and explain everything else through the residual value ε.
For example, it's not feasible to accurately predict a person's weight by considering only the relationship between height and weight. There are many factors that contribute to an individual's height, so in such cases, a more complex model like multiple linear regression is needed. The basic concept is the same as in simple linear regression, but multiple types of regressors, or different types of x, are considered. See the image below for a better understanding.


Just two clarifications are needed:
1) When analyzing a multiple linear regression model, it is necessary to evaluate:
- The sign of each regression coefficient (βn). This determines whether the additive effect of the corresponding regressor increases or decreases the value of y. It answers the question: If this regressor increases by one unit, does y increase or decrease on average?

- The absolute value of each regression coefficient (βn). This quantifies the magnitude of the corresponding regressor's effect on the value of y.
- The p-value associated with the hypothesis test of each regression coefficient (βn). This tells us whether the effect of the corresponding regressor on the value of y is statistically significant.
2) Graphical representation of a multiple linear regression model is only possible up to three regressors. Unfortunately, our brains are limited and cannot visualize a reality beyond three dimensions.

It's very important to clarify that in a multiple linear regression model, each regressor is examined individually, and each regression coefficient represents the marginal effect of the corresponding regressor on y while keeping the other regressors constant!!!!
Even in the context of multiple linear regression, where the relationship between a dependent variable and two or more independent variables is explained, it's crucial to verify certain key assumptions before proceeding with the analysis and for selecting the best model for predicting a phenomenon. These assumptions are necessary to ensure the reliability and validity of the model's parameter estimates and predictions. Luckily, the same assumptions seen for the simple linear regression model apply to multiple linear regression models, but there are additional ones, namely:
A) Absence of Multicollinearity: Independent variables should not be excessively correlated with each other. High multicollinearity can make it difficult to precisely estimate the effect of each independent variable and can inflate the variance of the parameter estimates.

To assess the assumption of absence of multicollinearity in a multiple linear regression model, various tests and statistical indicators are used. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other, making it difficult to isolate the individual effect of each variable. Here are the most common methods for detecting multicollinearity:
-
Variance Inflation Factor (VIF):
- VIF measures how much the variance of an estimated regression coefficient is increased due to multicollinearity. A VIF of 1 indicates no multicollinearity; values above 5 or 10 (depending on the guidelines adopted) suggest strong multicollinearity.
-
Tolerance:
- Tolerance is the inverse of VIF and measures the amount of variance of an independent variable not explained by the other independent variables. Values close to zero indicate possible multicollinearity.
-
Factor Analysis or Principal Component Analysis:
- These methods reduce the number of independent variables to a smaller set of uncorrelated components, helping to identify and remove multicollinearity.
-
Condition Index:
- The condition index is another measure that can be used to assess the presence of multicollinearity. High values of the condition index (above 30, for example) can indicate multicollinearity issues.
-
Examination of Correlations Among Independent Variables:
- A preliminary analysis of the correlations among independent variables can provide an indication of the presence of multicollinearity. Very high correlations between two or more independent variables suggest the possibility of multicollinearity.
-
Examination of the Significance of Coefficients:
- In the presence of multicollinearity, you might find that despite an overall significant model (e.g., a high R-squared), individual coefficients may not be statistically significant.
When multicollinearity is detected, various strategies can be adopted, such as removing one of the correlated independent variables, combining correlated variables, or using regularized regression methods like ridge or lasso regression, which are designed to handle multicollinearity.
B) Correctness of Functional Form: The model must be correctly specified, including all relevant variables and excluding any irrelevant variables. Omission of important variables or inclusion of unnecessary ones can lead to distorted estimates. To select important variables, obviously, a good understanding of the research context is necessary.
C) Absence of Influential Points or Outliers: The model should not be overly influenced by anomalous or influential data points. These points can have a disproportionate impact on the estimates and results of the regression.
To assess the assumption of absence of influential points or outliers in a regression model, it's necessary to identify and analyze any data that exert an undue influence on the regression model. These points can significantly distort the parameter estimates and predictions of the model. Here are some common methods and measures used to identify these points:
-
Leverage:
- Leverage identifies data points that are extreme with respect to the independent variables. Points with high leverage have a great potential to influence the regression line. Graphically, they can be identified through a leverage plot.
-
Cook's Distance:
- Cook's distance measures the influence of an observation on the model's predicted values. A high value of Cook's distance indicates that removing that particular observation significantly changes the model's results. Typically, a value greater than 1 is considered indicative of high influence.
-
DFFITS:
- DFFITS is a statistical measure that quantifies the influence of an observation on the predicted values. It provides an indication of how much the estimates would change if the observation were excluded from the model. Here too, high values indicate high influence.
-
DFBETAS:
- DFBETAS measures the influence of an observation on each regression coefficient. High values of DFBETAS for a given observation indicate that removing that observation would significantly change the corresponding regression coefficient.
-
Graphical Analysis:
- Residual Plots: A residual plot can help identify anomalies in the data, such as points with exceptionally high or low residuals.
- Influence Plots: Plots that combine various measures of influence, like leverage and Cook's distance, can be used to identify problematic observations.
-
Outlier Test:
- Specific tests like Grubbs' test can be used to identify outliers in the data.
It's important to note that identifying influential points or outliers does not automatically imply that they should be removed from the analysis. Each case should be examined in the context of the research problem. In some cases, these points may represent measurement or recording errors and can be removed. In other cases, they may be valid observations and provide important information about the phenomenon under study. The decision on how to handle these influential points should be guided both by statistical analysis and by knowledge of the application domain.
In general, verifying these assumptions is crucial for regression analysis. Diagnostic methods, such as residual analysis, statistical tests, and scatter plots, can be used to assess whether these assumptions are met. If some of these assumptions are not satisfied, modifications to the model or the use of alternative statistical techniques may be necessary.
Especially when it's necessary to select the right regression model for predicting the value of a variable y, it's crucial to find a balance between the potential error we might make and the complexity of the constructed regression model. In this context, we talk about the 'Bias-Variance Tradeoff' to define the equilibrium point that a model must reach to achieve the best predictive performance. If bias is reduced, variance generally increases, and vice versa. The goal is to find the right compromise between a model that is too simple and doesn't learn enough from the data (high bias) and a model that is too complex and learns too much from the training data, including its errors and noise (high variance).
The Bias-Variance Tradeoff helps determine the optimal complexity of the model. For example, a simple linear model might not adequately capture the complexity of the data (high bias), while a high-degree polynomial model might overfit the training data (high variance). Choosing the ideal model thus depends on finding a balance between these two extremes, taking into account the quantity and quality of available data, as well as the nature of the specific problem.

There's more!!!! When we want to make predictions about a certain variable y, we often need to take into account multiple regressors.
Sounds simple, right? Actually, no. Sometimes 'strange' effects come into play that influence the value of y.
Imagine trying to predict an individual's weight using a multiple linear regression model that considers regressors such as hours of sports activity per week, height, number of cigarettes smoked daily, and number of glasses of alcohol consumed in a day. A doctor might tell you that the model is incorrect because it does not take into account some interactions between the regressors used. For example, an interaction effect between the regressor 'number of daily smoked cigarettes' and the regressor 'number of glasses of alcohol consumed in a day' is hypothesized in the literature. In such a case, it's necessary to add a new specific regressor called an 'interaction regressor'. See the image below for a better understanding.

Other special regressors include qualitative regressors and nonlinear regressors.
Qualitative regressors account for the effect of categorical variables on a certain variable y. To be implemented in a mathematical model, the values of qualitative regressors must be encoded as numbers that describe membership in one or another category of the qualitative variable (see image below). When qualitative regressors are encoded as numbers, they are referred to as 'dummy variables'.
See the example and the formula of the model shown below.

!!! When using a simple or multiple linear regression model that includes qualitative regressors, it's important to remember that β0 is no longer considered the intercept but is the average value that the variable y assumes when all considered dummies are null.
IMPORTANT TAKEAWAY MESSAGE
In light of what has been said so far, keep this message in mind:
"To select the best regression model for predicting a certain value of y given one or more regressors, it is necessary to respect the assumptions of the models and evaluate the goodness of the model with metrics such as the coefficient of determination or the mean square error."
Appendix:
The coefficient of determination (often denoted as ( R^2 )) and the Mean Square Error (MSE) are both used to evaluate the performance of a regression model, but they serve distinct purposes and interpretations. Although both provide important information about the goodness of fit of a model, they measure different aspects of the prediction error.
-
Coefficient of Determination (( R^2 )):
- ( R^2 ) is a measure of the amount of variance in the dependent variable that is explained by the regression model. It ranges from 0 to 1, where a higher value indicates that a greater percentage of the total variance is explained by the model.
- A high ( R^2 ) suggests that the model provides a good explanation of the data; however, a high value does not necessarily imply that the model is appropriate or that it makes accurate predictions.
- ( R^2 ) does not account for the absolute size of the prediction errors, but only for their relative variance compared to the total variance of the data.
-
Mean Square Error (MSE):
- MSE is the average of the squares of errors, that is, the average of the squared differences between observed values and those predicted by the model.
- Unlike ( R^2 ), MSE provides an absolute measure of prediction error, giving an idea of the magnitude of errors in absolute terms.
- A lower MSE indicates better performance, as it suggests that prediction errors are smaller on average.
Purpose and Use:
- While ( R^2 ) is useful for understanding how well the model "explains" the variance of the data, it does not provide direct information on the accuracy of individual predictions.
- MSE, on the other hand, provides a direct estimate of how large the prediction errors are, on average, making it very useful for comparing different models or for optimizing a model on training data.
In conclusion, ( R^2 ) and MSE have related but not identical purposes in evaluating regression models. While ( R^2 ) offers a relative measure of model fit, MSE provides an absolute measure of prediction error. Both are important for a comprehensive assessment of a model's performance.
At this point, you might be wondering what happens if some of the assumptions related to linear regression models start to fail.
For example, when residuals are not normally distributed or when studying NON-linear relationships, more robust models such as Generalized Linear Models (GLMs) could be used. GLMs are a class of statistical models that generalize the classic linear regression model to allow responses (dependent variables) with error distributions other than normal and to establish a non-linear relationship between the dependent variable and independent variables.
In this regard I would like to discuss a concept that I found very complex to understand. I am referring to the question why linear regression is called "Linear"?
Linear regression is called "linear" because it models the relationship between a dependent variable and one or more independent variables using a linear equation. A linear equation is an algebraic equation in which each term is either a constant or the product of a constant and a single variable. Linear equations can be represented graphically as straight lines.
Here's a more detailed explanation:
-
Basic Form: The simplest form of a linear equation with one variable is:
ax + b = 0- Here, x is the variable.
- a and b are constants (where a is not zero).
-
Two Variables: Linear equations often involve two variables, typically x and y . The general form of such an equation is:
y = mx + c- y and x are the variables.
- m is the slope of the line, indicating the rate at which y changes with respect to x .
- c is the y-intercept, representing the value of y when x is 0.
-
Characteristics:
- Straight Line: When graphed on a coordinate plane, a linear equation forms a straight line.
- Constant Slope: The slope m is constant, meaning the rate of change of y with respect to x is the same throughout the line.
- No Exponents or Multiplications Between Variables: Linear equations do not include exponents (other than 1), square roots, or products of variables (like ( x \times y )).
-
Multiple Variables: Linear equations can be extended to more than two variables, for instance:
a1x1 + a2x2 + ... + a_nx_n = b- Each term a_ix_i is a product of a constant a_i and a single variable x_i.
- Such equations represent hyperplanes in multidimensional spaces.
If you want know more about linear equation please read this article.
In any case, in the context of linear regression, a linear equation models the relationship between the dependent and independent variables. If the equation remains linear (which means that the equation describe a straight line in two dimensions, a plane in three dimensions, or a hyperplane in higher dimensions), regardless of the number of variables involved, it's considered a linear equation!!!!!.
On the contrary, Non-linear regression models are called "non-linear" because, unlike linear regression models, they do not assume a linear relationship between the independent and dependent variables. In non-linear regression, the relationship is modeled by a non-linear equation, which means that the graph of this relationship is not a straight line. Look at the example below where you could see the difference in how a Logarithms regression model interpolates the data versus a linear regression model when the data appears to have a nonlinear but logarithmic relationship.

Here are some key aspects about NON-linear regression models:
-
Non-Linear Equations: In non-linear regression, the relationship between variables is represented by a non-linear equation. These equations include terms that are not linear in parameters, such as:
- Exponents (e.g., y = ax^2 + bx + c) ---> Keep attention for later @KeepAttention
- Logarithms (e.g., y = a*log(x) + b)
- Trigonometric functions (e.g., y = a*sin(x) + b)
- Products of parameters (e.g., y = axz + b)
-
Curved Line Graphs: When plotted on a graph, non-linear equations typically produce curves rather than straight lines. These curves can take various shapes, such as parabolas, hyperbolas, exponential curves, or sine waves.
-
Complex Relationships: Non-linear regression is used when the relationship between variables is more complex and cannot be adequately described by a straight line. It can model phenomena where the effect of the independent variable on the dependent variable changes in magnitude or direction at different levels of the independent variable.
-
Fitting the Model: Fitting a non-linear regression model is often more complex than linear regression. It usually requires iterative numerical methods to estimate the parameters that best fit the data, as there are no straightforward formulas like in linear regression.
In summary, non-linear regression models are termed "non-linear" because they represent relationships that cannot be adequately described by a straight line, requiring more complex equations to capture the curvature or varying rates of change observed in the data.
Finally, to make everything more complex there are exceptions such as the Polynomial regression model!!!!!
Polynomial regression model, despite involving polynomials (which are inherently non-linear functions (@KeepAttention)), is actually considered a form of linear regression model in the context of statistical modeling. This might seem counterintuitive, so let's clarify why this is the case:
-
Linear in Parameters: The key reason polynomial regression is classified as a linear model is because it is linear in terms of its parameters (coefficients), not necessarily in terms of the variables. For instance, a quadratic polynomial regression model can be expressed as:
y = β0 + β1x + β2x^2 ⋯ βn*xn + ε
Here, x^2 is a non-linear term in the variable x, but the model is linear in the parameters β0, β1 and β2. This linearity in parameters means that we can estimate these coefficients using linear regression techniques. -
Graphical Representation: Graphically, polynomial regression models produce curves, not straight lines. For example, a quadratic model (involving ( x^2 )) produces a parabolic curve. This is a clear indication of the non-linear relationship between the independent and dependent variables.
-
Flexibility in Modeling Non-Linear Relationships: Polynomial regression allows linear models to fit more complex, non-linear data by introducing higher-degree terms of the independent variables. This makes it a powerful tool for modeling non-linear phenomena while retaining the computational simplicity of linear regression.
In summary, polynomial regression is considered a linear model because it is linear in the coefficients, even though it models non-linear relationships in terms of the variables. This distinction is important in understanding how different types of regression models are classified and applied. However, I won't discuss this, at least for the moment, as it would lead us into the world of machine learning. In fact, we have reached the end of our journey through the basics of statistics where we first made a brief introduction to this fascinating world (link1, link2), then talked about descriptive statistics (part1, link1), probability theory (part1, part2), and inferential statistics (part1, part2, part3).
What lies ahead?
I'm brimming with enthusiasm (and coffee) as I write these lines for you. What an incredible journey we have embarked on together! 🌟 Remember, I'm not here to play the professor, but rather to share with you every little discovery I make on my path through the magical world of learning and bioinformatics.
🎉 A Huge Thank You for traveling with me this far, but now, hold tight, because we are about to dive into the fascinating universe of machine learning! Imagine yourselves as little digital explorers, ready to decipher the secrets of machines that learn. 🌐
I hope this new chapter will be enlightening for both me and you. Together, we'll uncover how these wonderful machines think, act, and who knows, maybe even dream!
🎄 But now, take some rest. The holidays are just around the corner, so here's a warm wish for a Merry Christmas and a fantastic New Year! May it be filled with learning, smiles, and bug-free code (or at least with few)!
See you next time, dear readers! The future of machine learning awaits us! 🌟🎅
P.S. As a Christmas gift, I offer you these very simplified but summary diagrams of our journey through the fantastic world of basic statistics.




Omar Almolla