Back to Writing: Some Thoughts on Machine Learning

Hello. Finally, I’m back to writing on the blog. Intense work commitments and plenty of unexpected events have limited the time I could spend studying and sharing with you my progress in the world of bioinformatics and, more generally, in data analysis. But today, despite the huge backlog of work I still have, I really need to stop for a moment and talk to you about some notions that, bit by bit, I’ve had the chance to learn. After all, writing on the blog is the only way I have to review what I’m learning, hoping not to forget it all instantly (fun fact: I have a terrible memory 😅). And besides, writing is fun and relaxing for me, so it’s almost therapeutic.

Recently, I’ve had the good fortune to work more on machine learning projects, and I’m realizing that, while bioinformatics remains my main focus, I’m becoming increasingly passionate about studying the various ML algorithms. And since that’s how I am, when something fascinates me, I tend to mentally arrange it into categories (a bit arbitrary, I admit). So I created my own mental divisions which may not be perfect, but they help me see things more clearly. And if I’m wrong? Well, you’re more than welcome to correct me in the comments.

To begin with, I think it’s useful to distinguish between Artificial Intelligence, Machine Learning, and Deep Learning.

Then, another fundamental distinction is between the different learning approaches that a machine (well, a model) can follow. And here, simplifying a lot, I’d say at least three:

Supervised learning: here we already have known labels (for example “sick” vs “healthy”), and the model learns to predict a target. If the target is numeric, we talk about regression; if it’s categorical, we talk about classification.
Unsupervised learning: here instead there are no predefined labels, and the goal is to discover hidden patterns in the data. Think of clustering: the algorithm tries to group observations by similarity. It doesn’t predict known labels, but it “invents” groups and tells us which group each observation belongs to.
Reinforcement learning: this is the one I personally find most fascinating, because it’s based on the “trial and error” mechanism: the agent interacts with an environment, receives rewards or penalties, and gradually learns to maximize the reward.

What I’ve noticed as I’ve gone deeper into these worlds is that, despite the variety of approaches and models, there exists a common workflow. Whether you’re playing with a linear regression or a deep neural network with a thousand parameters, the basic steps are surprisingly similar.

And so, let me tell you how I see this workflow:

1. Defining the task

The first thing is always to clarify what we want the model to do. After all, our ultimate goal in machine learning is to create a model that can predict something: a number, a class, a group.

If we want a value → regression.
If we want a class → classification.
If we want to discover hidden groups → clustering.

2. Checking assumptions

Every model has its rules of the game. It’s not like we can just throw the data in and—magic!—it works. There are conditions that need to be respected.

Classic example: linear regression. To correctly estimate parameters and do inference, we need to assume linearity, independence of errors, homoscedasticity, and normality of residuals. Note: these assumptions are crucial if we want to trust the estimates and statistical tests, but if the goal is purely predictive, violating some of them (like normality) is not such a big deal.

3. Data preprocessing

Here begins the “crafting” part, almost mathematical DIY, because raw data are rarely immediately “appetizing” for a model.

Feature encoding: turning categories into numbers.
Feature transformation: scaling, normalization, standardization.
Handling missing data: dropping, imputing, etc.
Data dimensionality reduction: PCA, t-SNE, UMAP, TDA, etc.

A jungle of terms, but the idea is simple: make the data more digestible for the model.

4. Training and optimizing the model

Now we come to a very complex part of our workflow. Despite its complexity, the principle is actually quite simple, and it’s that famous saying: “you learn by making mistakes.”

The model trains to predict the target based on the data available, and during this process it “adjusts its aim” little by little to reduce its error.

So it’s logical to ask: how do we measure a model’s error? And how do we reduce it?

Every model has its own cost function (or loss function), a mathematical function that measures predictive error. The best model, by definition, is the one that achieves the lowest value of this function. To minimize the cost function, we use algorithms called optimizers, which adjust the model’s parameters (often called “weights”). Important note: not all models update weights through iterative algorithms; for example, classical linear regression has a direct analytical solution. But in modern and complex models—neural networks, logistic regression, SVMs—numerical optimization is essential.

Here’s an analogy I like: imagine an athlete trying to perform a perfect backflip. At first, they fail, they fall, but thanks to their coach (the optimizer) they get constant feedback on how to move their body. And so, by making mistakes and correcting them, the athlete improves more and more until they reach their goal.

Said like that, it seems easy, but in practice things get complicated. Often the model makes very precise predictions on the training dataset but fails miserably on new data. This is overfitting: the model learned the training set’s details and noise too well and thus generalizes poorly. More technically: low average difference between prediction and true value (low bias), but high variability in predictions when the dataset changes (high variance).

The opposite is underfitting: the model is too simple and can’t even fit the training data well. Here we see a high average difference (high bias) but low variability (low variance).

In general:

The average difference between prediction and true value = the model’s bias.
The variability of predictions across training sets = the model’s variance.

A technical detail (which I’ll leave pending for a future article): variance can be estimated in different ways. In classical frequentist models (OLS, Random Forest) with bootstrap; in Bayesian models analytically as noise + parameter uncertainty; in deep learning through approximations like Monte Carlo dropout or ensembling.

The takeaway: if we don’t balance bias and variance, we end up with overfitting or underfitting. And this is where the famous bias–variance trade-off comes in.

Bias and variance are closely linked to both the model’s error and its complexity. Complexity measures how well the model can adapt to the data, i.e., how many parameters it uses to capture patterns.

Some examples:

Polynomial regression → the polynomial degree controls complexity.
Decision trees → the number of nodes or depth.
Neural networks → the number of layers and neurons per layer.

More complexity → more ability to adapt… but also more risk of overfitting. Less complexity → more risk of underfitting. The trick is to find the right compromise, that balancing point we all know from the classic bias–variance trade-off graph.

Another interesting point: models that overfit are often too complex. They are excellent at memorizing training data but terrible at generalizing to unseen data. This leads to a key concept: to truly evaluate a model, we always need a testing phase on unseen data.

Two classic strategies:

Hold-Out: split data into two groups, typically 70% for training and 30% for testing. By evaluating performance on both, with metrics and plots (learning curve, accuracy curve, etc.), we can spot overfitting or underfitting.
Cross-Validation: here you don’t always need a preliminary hold-out (though some do it). In general, we split the dataset into K portions. In each iteration, we train on K-1 portions and validate on the remaining one. Thus we get K models and average their performances. Finally, a separate test set can be used for the final evaluation.

5. Understanding model predictions

And here we come to one of the points that fascinates me most about machine learning: once we have a model that works well, that predicts correctly, the inevitable question arises: how the heck does it do it?

Yes, we could just use the model as a black box: put data in → prediction out → everyone’s happy. But if we stop for a second, curiosity kicks in: what did the model rely on to reach that result?

For simple cases like a linear regression with few variables, the story is straightforward: the coefficients tell us immediately which variable weighs more or less. But with more complex models—deep neural networks, decision trees nested in forests, gradient boosting with hundreds of splits—things get much less readable. That’s where we face the so-called black box problem.

Now, it’s not that nothing clear is happening inside: mathematically, the model works, it processes input and produces output. It’s just that, for us humans, reconstructing why a prediction was made is almost impossible at a glance.

And this is where an increasingly important field comes in: Explainable AI (XAI), the set of techniques that aim to at least partly open that black box.

The approaches are diverse, but to give you an idea, we can divide them into two families:

Global methods: help us understand, in general, how the model uses variables. For example, feature importance tells us which variables are most influential on average, while partial dependence plots show how predictions change as we vary one variable while holding the others fixed.
Local methods: focus instead on a single prediction, i.e., they try to explain why the model made that decision for that specific data point. Here come tools like LIME and especially SHAP.

SHAP in particular is very popular because, based on game theory, it assigns each feature a kind of “contribution” to the final prediction. So, in a classification problem, we might say: “The model classified this patient as high-risk because blood pressure contributed +0.3, cholesterol +0.2, while age only added +0.05.” A sort of detailed report of the decision.

So Explainability is not just an intellectual fancy, but a practical necessity:

In medical or legal contexts, it’s needed to trust predictions (not enough for them to be correct, they must also be explainable).
In research, it helps discover hidden patterns (if a feature is always important, maybe there’s an underlying biological mechanism).
And in general, it allows us to better validate the model.

In short, XAI is a kind of bridge between the mathematics of models and human understanding. And trust me, sooner or later you’ll have a model that works great, but a reviewer or collaborator who’ll ask: “Okay, nice, but how does it do it?” And that’s when XAI will be your best ally.

6. Model deployment

And finally, the last step of our journey: deploy.

Because training a model is fun, experimenting is exciting, parameter tuning is stimulating… but in the end, if we want the model to have a real impact, we need to put it into production.

What does that mean? It means making it available outside our notebook or local script, so that others can use it easily, maybe even without knowing anything about machine learning.

The process usually goes like this:

First, we test the model once more on completely new data (perhaps collected later). This is the last check to make sure the model isn’t just good “by chance” but can truly generalize.
Then, if the model passes the test, we can make it available. Options include:
- Local deploy: the model runs on our computer or a dedicated server, and we use it via scripts or local interfaces.
- Web/API deploy: we create a service accessible to others via API, for example using Flask, FastAPI, or Django. This way anyone can send data and receive predictions.
- Integrated in pipelines: the model becomes part of a larger workflow, for example an automated bioinformatics analysis.

A key point in deployment is scalability: a model that works on our laptop often needs adaptation to handle many more data or concurrent requests. Here tools like Docker, Kubernetes, or cloud services (AWS, GCP, Azure) come into play.

And let’s not forget another fundamental aspect: monitoring. A model in production isn’t a statue: it’s a dynamic system. Real data can change over time (data drift), distributions shift, and what worked well at the beginning may degrade quickly. That’s why monitoring and periodic updates are essential.

In other words: deploy is not the end of the story, but the beginning of a new phase, the one where the model becomes truly “useful.” It’s a bit like training an athlete: gym practice is important, but the real test comes when they step into the arena in front of the audience.

Final thoughts

And with that, I think I’ve given you quite a feast of chatter about machine learning today 😅. As always, don’t take this as an “academic lecture,” but rather as my personal notebook open to everyone. I write to review and fix in my little head what I’m learning day by day, and inevitably I might say something imprecise or even wrong.

But you know what? That’s perfectly fine for me. Because you learn by making mistakes (true for models but also for us humans), and if someone in the comments wants to correct me or share their own experience, I’ll only be glad: it means I’ll have learned something more.

In the end, this blog is nothing more than a small space where I share my journey, with my enthusiasm, my doubts, and even my mistakes. And if these lines help someone to clarify ideas, review concepts, or simply have a laugh at my ramblings… well, mission accomplished 😉.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

BIOINFORMATICAMENTE

A journey into the world of bioinformatics

Back to Writing: Some Thoughts on Machine Learning

1. Defining the task

2. Checking assumptions

3. Data preprocessing

4. Training and optimizing the model

5. Understanding model predictions

6. Model deployment

Final thoughts

References:

Like this:

RispondiCancel reply

1. Defining the task

2. Checking assumptions

3. Data preprocessing

4. Training and optimizing the model

5. Understanding model predictions

6. Model deployment

Final thoughts

References:

Share this:

Like this:

RispondiCancel reply

Discover more from BIOINFORMATICAMENTE