<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS_HTML">

Hello! I finally found some free time to get back to writing on my blog. Have I ever told you how I ended up becoming a bioinformatician (or rather, how I started hoping to become one)?
I graduated in Agricultural Sciences and later earned a Master’s degree in Plant Biotechnology. I took a very introductory bioinformatics course and then "played" at being a bioinformatician during my internship and thesis. I never had any courses in statistics, machine learning, or data analysis during my university studies. I learned everything—and I’m still learning—on my own. I study from books, YouTube, and platforms that offer online courses, such as Physalia and Coursera.

Why am I telling you all this?
Well, partly to create a friendly atmosphere, but mainly to share an anecdote that will allow me to introduce today's topic in a way that is not too academic or heavy.

My Frustration with Bioinformatics at the Beginning

During my thesis period, I could barely type commands in Bash. I used scripts written by my supervisor to run analyses that only he could understand. Sometimes, my only task was to change file paths in the code and press Enter. It was frustrating because I didn’t understand anything that was happening in the terminal. However, I have always been a curious and determined person, so I didn’t let that discourage me—I kept asking why things worked the way they did. But my supervisor had a thousand responsibilities and couldn’t always spend time explaining things to a beginner like me.

There was one concept that kept coming up, and at first, it seemed like a mythical three-headed monster: data dimensionality reduction.
So today, I’d like to talk about this, focusing on one of the most widely used dimensionality reduction techniques: Principal Component Analysis (PCA).

The Curse of Dimensionality

Let’s start with a problem that affects all of us who analyze data, especially in the biological field: The curse of dimensionality.

This term refers to the computational, analytical, clustering, and visualization challenges that arise when dealing with high-dimensional data.

What Do We Mean by "Space"?

In this context, "space" refers to the plane or multidimensional environment where data points are represented graphically.

For example:

  • If we have two variables, we can visualize the data on a 2D plane.

  • If we have three variables, we can imagine a 3D space.

Beyond three dimensions, we can’t visualize it intuitively, but mathematically, it follows the same principles.

What is a Dimension?
To answer this, let's first clarify how data is structured:

Data is typically stored in an N × P matrix, where:
N = the number of observations (e.g., cells, individuals, samples).
P = the number of variables (e.g., gene expression levels, peak accessibility, etc.).

Each variable (P) represents a dimension—that is, an axis in space along which data points can vary.

More dimensions mean more complexity. If we add more variables, the data "spreads out" into additional directions. This makes analysis, clustering, and visualization more difficult.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Create toy matrices
N = 6  # Number of cells (observations)

# Matrix 1: One gene (A)
m1 = np.random.rand(N, 1) * 10  # Expression levels for Gene A

# Matrix 2: Two genes (A, B)
m2 = np.random.rand(N, 2) * 10  # Expression levels for Gene A and Gene B

# Matrix 3: Three genes (A, B, C)
m3 = np.random.rand(N, 3) * 10  # Expression levels for Gene A, Gene B, and Gene C

# Display matrices using pandas
df_m1 = pd.DataFrame(m1, columns=["Gene A"])
df_m2 = pd.DataFrame(m2, columns=["Gene A", "Gene B"])
df_m3 = pd.DataFrame(m3, columns=["Gene A", "Gene B", "Gene C"])

print("\nMatrix 1 (N=6, P=1):")
print(df_m1)
print("\nMatrix 2 (N=6, P=2):")
print(df_m2)
print("\nMatrix 3 (N=6, P=3):")
print(df_m3)

# Plot data
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot Matrix 1 (1D scatter)
axes[0].scatter(np.arange(N), m1[:, 0], color='blue', label="Gene A")
axes[0].set_title("Matrix 1: N = 6, P = 1 (Gene A)")
axes[0].set_xlabel("Cell Index")
axes[0].set_ylabel("Gene Expression")
axes[0].legend()

# Plot Matrix 2 (2D scatter)
axes[1].scatter(m2[:, 0], m2[:, 1], color='red', label="Cells")
axes[1].set_title("Matrix 2: N = 6, P = 2 (Gene A, Gene B)")
axes[1].set_xlabel("Gene A Expression")
axes[1].set_ylabel("Gene B Expression")

# Plot Matrix 3 (3D scatter)
ax3d = fig.add_subplot(133, projection='3d')
ax3d.scatter(m3[:, 0], m3[:, 1], m3[:, 2], color='green')
ax3d.set_title("Matrix 3: N = 6, P = 3 (Gene A, Gene B, Gene C)")
ax3d.set_xlabel("Gene A Expression")
ax3d.set_ylabel("Gene B Expression")
ax3d.set_zlabel("Gene C Expression")

# Show plots
plt.show()
### OUTPUT OF THE CODE ABOVE

Matrix 1 (N=6, P=1):
     Gene A
0  3.745401
1  9.507143
2  7.319939
3  5.986585
4  1.560186
5  1.559945

Matrix 2 (N=6, P=2):
     Gene A    Gene B
0  0.580836  8.661761
1  6.011150  7.080726
2  0.205845  9.699099
3  8.324426  2.123391
4  1.818250  1.834045
5  3.042422  5.247564

Matrix 3 (N=6, P=3):
     Gene A    Gene B    Gene C
0  4.319450  2.912291  6.118529
1  1.394939  2.921446  3.663618
2  4.560700  7.851760  1.996738
3  5.142344  5.924146  0.464504
4  6.075449  1.705241  0.650516
5  9.488855  9.656320  8.083973

And look at what happened using P = 4
# Create a new matrix with four genes (A, B, C, D)
m4 = np.random.rand(N, 4) * 10  # Expression levels for Gene A, B, C, and D

# Display the matrix
df_m4 = pd.DataFrame(m4, columns=["Gene A", "Gene B", "Gene C", "Gene D"])
print("\nMatrix 4 (N=6, P=3):")
print(df_m4)

# Plot the 4D matrix (only 3 dimensions can be plotted in 3D)
fig = plt.figure(figsize=(8, 6))
ax4d = fig.add_subplot(111, projection='3d')

# Plot first three genes in 3D (since we can't visualize 4D directly)
sc = ax4d.scatter(m4[:, 0], m4[:, 1], m4[:, 2], c=m4[:, 3], cmap='viridis')

# Add colorbar to represent Gene D
cbar = plt.colorbar(sc, ax=ax4d, shrink=0.5, aspect=5)
cbar.set_label("Gene D Expression")

# Labels and title
ax4d.set_title("Matrix 4: N = 6, P = 4 (Gene A, B, C, D)")
ax4d.set_xlabel("Gene A Expression")
ax4d.set_ylabel("Gene B Expression")
ax4d.set_zlabel("Gene C Expression")

# Show plot
plt.show()

### OUTPUT

Matrix 4 (N=6, P=3):
     Gene A    Gene B    Gene C    Gene D
0  3.109823  3.251833  7.296062  6.375575
1  8.872127  4.722149  1.195942  7.132448
2  7.607850  5.612772  7.709672  4.937956
3  5.227328  4.275410  0.254191  1.078914
4  0.314292  6.364104  3.143560  5.085707
5  9.075665  2.492922  4.103829  7.555511

Oops! If we have more than three genes, we can no longer directly visualize the data.
Our brain is not designed to perceive dimensions beyond three, and using a high number of dimensions leads to mathematical and computational challenges.

Let’s take an example!!!

Suppose we have a multiple linear regression as follows.

If 𝑃 ≫ 𝑁, calculating the slope coefficient 𝛽1 becomes problematic because the system is underdetermined, meaning that infinite vectors β satisfying the equation and so leading to non-unique solutions. This happens because the *variance-covariance matrix (𝑋^T X)^-1** becomes singular, meaning that the regression model cannot be solved uniquely.

Unfortunately, I am not a mathematician, and while there are fascinating mathematical proofs behind all of this, I won't go into those details. However, I believe I have clarified that having 𝑃 ≫ 𝑁 is a very common phenomenon, especially in biological data analysis. Consider, for example, an RNA-seq dataset, where we typically measure the expression levels of thousands of genes (𝑃, each gene is a variable) in only a few individuals (𝑁, each individual is a statistical observation).

In transcriptomic datasets, it is common to analyze more than 20,000 genes across fewer than 100 samples, creating a classic high-dimensionality problem.

At this point, it should be clear: the curse of dimensionality tells us that having P≫N poses challenges for visualization, analysis, and mathematical operations. To make these tasks feasible, we need to reduce P, bringing our data to a more manageable state where P ≤ N. This is where the concept of "Dimensional Reduction" comes from.

Now, let's explore the dimensional reduction methods we have at our disposal.

Types of Dimensional Reduction Approaches

In general, there are different dimensionality reduction techniques, and different types of data may require different approaches.

We can categorize dimensionality reduction methods into two main groups:

- Feature Selection Methods

In this approach, less important features—those that explain less variability in the data are removed. This means keeping only the most relevant features, selected based on statistical criteria or machine learning models.

- Feature Extraction Methods

Instead of removing variables, this approach creates new features that condense information from the original ones. Think of it as a way to summarize multiple variables into fewer dimensions while preserving as much information as possible.

Feature extraction methods can be further divided into two subcategories:

- Linear methods: New features are created as linear combinations of the original variables (e.g., Principal Component Analysis (PCA)). These methods aim to maximize variance preservation, but they do not necessarily preserve absolute distances between points in the transformation.

- Non-linear methods: New features are obtained without linear transformations of the original variables (e.g., t-SNE, UMAP). These methods do not preserve absolute distances, focusing instead on maintaining local or relative distances between points, meaning that clusters and local structures are preserved, but global relationships may be distorted.

Principal Component Analysis (PCA)

Now that we have laid the foundation for dimensional reduction, let’s dive into one of the most widely used methods: PCA (Principal Component Analysis).

In short, PCA is a dimensional reduction method that uses a linear feature extraction approach, meaning it creates new variables (principal components, or PCs) as linear combinations of the original ones.

The key steps in PCA are:

1) Scaling the data – Standardizing the variables to have mean 0 and variance 1.

2) Computing the covariance matrix – Measuring how the variables relate to each other.

3) Finding eigenvalues and eigenvectors – Extracting the principal components that explain the most variance.

4) Sorting and selecting the principal components – Keeping only the most informative ones.

5) Projecting the data onto the selected PCs – Transforming the original data into the new reduced space.

1) Scaling the Data

PCA is sensitive to the scale of the variables because it is based on the covariance or correlation matrix. If the variables have very different ranges, those with larger values will dominate the variance and distort the results.

To avoid this issue, it is essential to standardize the variables, ensuring that each has a mean of 0 and a standard deviation of 1 (Z-score normalization). This makes variables comparable, preventing those with larger scales from disproportionately influencing the analysis.

2) Computing the Covariance Matrix

Now, we need to analyze the relationships between variables to understand how they vary together. To do this, we can compute a covariance matrix, which describes the linear dependencies between variables.

The covariance matrix is a symmetric matrix where each element represents the covariance between two variables. The diagonal elements indicate the variance of each individual variable, while off-diagonal elements describe how two variables co-vary.

How the eigenvectors and eigenvalues are calculated?


Why Do We Need to Compute Eigenvectors and Eigenvalues?
These concepts are at the core of PCA, as they define the principal components, which allow us to reduce the dataset's dimensionality.

Some of you may have noticed that we have introduced a key concept: Principal Components (PCs).

Principal components are the true essence of PCA, because they are what actually enable dimensionality reduction. With Principal components Analysis we can "concentrate" the original features into in a few new variables while preserving most of the original information.
A principal component is a new variable that is created as a linear combination of the original variables, with weights determined by the eigenvectors, indeed the weights are the values of the eigenvector associated to the principal component considered.
In simpler terms, PCA condenses the original features into a smaller set of variables, ensuring that the most important sources of variance are still captured.

How is a principal component calculated?

At this point a question arises:

How are the weights within each principal component defined?

This is a very good question. We said that each principal component is associated with an eigenvector which describes the direction of the component and thus along which the data are projected, while the how these data are "spaced" depends on the eigenvalue associated to the component. We also said that the values of an eigenvector are the weights used in the linear combination of the original variables that we use to calculate the principal component. But how do we decide the weights to use?

The weights are calculated for each major component on the basis of the covariance matrix previously calculated, that is by solving the following equation:

Great question! Let's go step by step to clarify how decides the weights (i.e., the values of the eigenvectors) in PCA.


1) The Weights Are in the Eigenvectors

In PCA, we want to transform our data into a new coordinate system where:

  • The first axis captures the most variance.
  • The second axis captures the second most variance.
  • And so on...

To achieve this, we find the eigenvectors of the covariance matrix ( A ), which define the new coordinate axes.
These eigenvectors provide the weights for the linear combination of the original variables.


2) Who Decides the Weights?

The weights (eigenvectors) are not manually chosen—they are determined mathematically by solving the equation:

A v = λ v

where:

  • A is the covariance matrix.
  • v is the eigenvector (which defines the new coordinate direction).
  • λ is the eigenvalue (which tells how much variance is explained by that eigenvector).

This equation means that the eigenvectors must be chosen so that when multiplied by A, they only get scaled by their eigenvalue λ.

So, the structure of the data itself (the covariance between variables) dictates the eigenvectors.

That’s why we say the weights are decided by the data.


3) Step-by-Step: How the Weights Are Found

a) Compute the covariance matrix ( A )

  • This tells us how variables are related to each other.

b) Solve for eigenvectors and eigenvalues

  • The eigenvectors of ( A ) are found by solving ( A v = \lambda v ).
  • Mathematically, we solve:

    det(A - λ* I) = 0

    to find λ (eigenvalues), and then solve for v (eigenvectors).

c) Use the eigenvectors as transformation weights

  • Each eigenvector provides the coefficients (weights) for forming the principal components.

4) Example with Two Variables


All this mathematics might seem overwhelming, but one key concept that must not be overlooked is the role of weights in the linear combination of each principal component.

The weights assigned to each original variable vary across different principal components, and they indicate how much each variable contributes to explaining the variance in the dataset.

If 𝑤 ≫0, the original variable strongly contributes to explaining the variance in that principal component. If w=0, the variable does not contribute to that component. If w≪0, the variable has a strong contribution but in the opposite direction, meaning it is negatively correlated with that component.

Is also important understand how to visualise weights.

Understanding Loadings in PCA Using the Given Plot:

The image above is a PCA biplot, which is a common way to visualize loadings (weights) in Principal Component Analysis.

1) What Do the Blue Arrows Represent?

  • The blue arrows (vectors) represent the loadings, which show how each original variable contributes to the new principal component axes.
  • The direction and length of each arrow indicate how strongly each variable influences each principal component (PC1 and PC2 in this case).

2) How to Interpret Loadings in This Plot

  • Longer arrows → The variable has a strong influence on that principal component.
  • Shorter arrows → The variable has a weaker contribution to that PC.
  • Angles between arrows → The correlation between variables:
    • If two arrows point in the same direction, the variables are positively correlated.
    • If two arrows point in opposite directions, the variables are negatively correlated.
    • If two arrows are perpendicular, the variables are uncorrelated.

3) Example Interpretation from the Plot

  • The longest arrow likely represents a variable that dominates the variance in PC1.
  • If an arrow is aligned with PC1, that variable contributes more to PC1 than to PC2.
  • If an arrow is aligned with PC2, that variable contributes more to PC2.
  • If an arrow is in between PC1 and PC2, it means that the variable contributes to both components.

4) Why Is This Important?

  • Loadings help us understand what each principal component represents.
  • If we see that certain variables cluster together, we can infer underlying patterns in the dataset.
  • This visualization is useful for dimensionality reduction, feature selection, and data interpretation.

4) Ordering and Selecting Principal Components

From the previous step, we obtain several principal components (PCs), each representing a linear combination of the original variables and explaining a certain amount of variance in the data. Since our goal is to reduce the number of variables and limit dimensionality, we discard PCs that explain very little variance, keeping only the most informative ones.

To achieve this, we rank the PCs in descending order based on their associated eigenvalues, which represent the amount of variance explained by each component.
We then select the most important components, often choosing the first k PCs that together explain a significant proportion of the total variance (typically 80-95%).

The optimal number of components k can be determined using an Scree Plot, which displays eigenvalues and helps identify a cut-off where adding more components provides diminishing returns.

5) Projecting Data onto the 𝑘 Principal Components

We have now reached the final step of PCA: transforming the data into the new reduced-dimensional space.

Now, instead of having one axis for each original variable, the new space has one axis for each of the k selected principal components (PCs).

  • The direction of each PC is determined by its corresponding eigenvector.
  • The variance explained by each PC is determined by its eigenvalue.

This transformation allows us to represent the data in fewer dimensions while preserving the original variance, making visualization, clustering, and further analysis more efficient.

Here is important to introduce the last two concepts:

In PCA, each principal component (PC) explains a certain proportion of the total variance in the dataset. This is determined using the eigenvalues associated with each PC.

- Formula for Explained Variance

Each eigenvalue 𝜆𝑘 represents the amount of variance explained by the corresponding principal component PCk.

The fraction of variance explained by a given component 𝑘 is:

- Formula for Cumulative Explained Variance:

The cumulative explained variance tells us how much total variance is preserved by the first k components:

An easy example:


Some practice:

From: https://github.com/erilu/bulk-rnaseq-analysis/tree/master?tab=readme-ov-file#pca-plot

Below is a complete Python script that reads a bulk RNA‑seq raw counts file, performs basic statistical description and normalization, then runs PCA. It generates a Scree Plot (explained variance) and a loadings plot, shows how to select the optimal number of components, and evaluates the PCA performance via reconstruction error.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 1. Load the dataset
# Assuming the CSV file has genes as rows and samples as columns, with gene IDs in the first column.
df = pd.read_csv("/home/omar/Downloads/raw_counts.csv", index_col=0)

# 2. Basic Statistical Description
print("Dataset Shape (Genes x Samples):", df.shape)
print("\nSummary Statistics:")
print(df.describe())
print("\nMissing values per gene:")
print(df.isnull().sum())

# 3. Data Normalization
# A log2 transformation (with a pseudo-count of 1) is applied to stabilize variance. 
# The data is then transposed so that each sample is a row and each gene is a feature.
# Bulk RNA‑seq raw counts are typically log-transformed to stabilize variance.
# We apply a log2(count + 1) transformation.
df_log = np.log2(df + 1)

# PCA expects samples as rows and features as columns.
# If genes are rows and samples are columns, we need to transpose.
df_log = df_log.transpose()
print("\nAfter Log Transformation and Transposition:")
print("Shape (Samples x Genes):", df_log.shape)
print(df_log.head())

# 4. Standardize the Data (mean = 0, std = 1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_log)

# 5. Perform PCA on the scaled data
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# 6. Explained Variance and Scree Plot
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
print("\nExplained Variance Ratio for each PC:")
print(explained_variance)
print("\nCumulative Explained Variance:")
print(cumulative_variance)

# Plot Scree Plot (individual and cumulative explained variance)
plt.figure(figsize=(8, 5))
components = np.arange(1, len(explained_variance) + 1)
plt.plot(components, explained_variance, 'o-', label='Individual Explained Variance')
plt.plot(components, cumulative_variance, 's--', label='Cumulative Explained Variance')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')
plt.legend()
plt.grid(True)
plt.show()

# 7. Loadings Plot (for the first 2 principal components)
# The loadings are the eigenvectors (pca.components_) transposed.
loadings = pca.components_.T  # Each column corresponds to a PC; each row to a gene.
plt.figure(figsize=(10, 6))
plt.scatter(loadings[:, 0], loadings[:, 1], alpha=0.7)
plt.xlabel('PC1 Loadings')
plt.ylabel('PC2 Loadings')
plt.title('Loadings Plot for PC1 and PC2')
plt.axhline(0, color='grey', lw=0.5)
plt.axvline(0, color='grey', lw=0.5)
plt.grid(True)
plt.show()

# 8. PCA Scatter Plot with Sample Labels
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c='blue', edgecolor='k', alpha=0.7)
# Annotate each sample with its index or name (if available)
for i, sample in enumerate(df_log.index):
    plt.text(X_pca[i, 0], X_pca[i, 1], str(sample), fontsize=10, color='black')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Projection of Samples onto PC1 and PC2')
plt.grid(True)
plt.show()

# 9. Hyperparameter Optimization and Evaluation

# Here the code selects the smallest number of PCs that explain at least 90% of the variance.
optimal_components = np.argmax(cumulative_variance >= 0.90) + 1
print("\nOptimal number of components to explain at least 90% of the variance:", optimal_components)

# 10. Evaluate PCA performance via reconstruction error

# After selecting the top 𝑚 principal components (PCs), you can reconstruct the original data
# by projecting the lower-dimensional representation back into the original feature space.
# This inverse transformation helps you quantify how much information was lost by keeping only 𝑚 components.
# Data reconstruction is a direct method to evaluate how faithfully the chosen principal components capture the underlying structure of the dataset.
# A lower reconstruction error means better preservation of the original information, indicating a successful PCA.
pca_opt = PCA(n_components=optimal_components)
X_pca_opt = pca_opt.fit_transform(X_scaled)
X_reconstructed = pca_opt.inverse_transform(X_pca_opt)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print("Reconstruction Error (Mean Squared Error):", reconstruction_error)

OUTPUTS

Dataset Shape (Genes x Samples): (60676, 8)

Summary Statistics:
       PC3_Normoxia_S1  LNCAP_Hypoxia_S1  LNCAP_Hypoxia_S2  LNCAP_Normoxia_S2  \
count     6.067600e+04      60676.000000      60676.000000       6.067600e+04   
mean      2.102103e+02        617.710034        702.666211       8.002348e+02   
std       2.132506e+04       5544.522793       5970.701438       1.117891e+04   
min       0.000000e+00          0.000000          0.000000       0.000000e+00   
25%       0.000000e+00          0.000000          0.000000       0.000000e+00   
50%       0.000000e+00          0.000000          0.000000       0.000000e+00   
75%       5.000000e+00         23.000000         26.000000       3.100000e+01   
max       4.774002e+06     770450.000000     775792.000000       1.695973e+06   

       PC3_Hypoxia_S1  PC3_Hypoxia_S2  PC3_Normoxia_S2  LNCAP_Normoxia_S1  
count    6.067600e+04    6.067600e+04     6.067600e+04       6.067600e+04  
mean     6.162687e+02    2.096620e+02     6.224485e+02       6.600122e+02  
std      6.654010e+04    2.270014e+04     6.810140e+04       9.404802e+03  
min      0.000000e+00    0.000000e+00     0.000000e+00       0.000000e+00  
25%      0.000000e+00    0.000000e+00     0.000000e+00       0.000000e+00  
50%      0.000000e+00    0.000000e+00     0.000000e+00       0.000000e+00  
75%      1.200000e+01    5.000000e+00     1.300000e+01       2.500000e+01  
max      1.547065e+07    5.283176e+06     1.552876e+07       1.404144e+06  

Missing values per gene:
PC3_Normoxia_S1      0
...

Optimal number of components to explain at least 90% of the variance: 5
Reconstruction Error (Mean Squared Error): 0.05425498410826625


References: