Hi, everyone. How are you spending this Halloween? I’m working. But I don’t complain, it’s not a festivity that attracts me particularly. Today I decided to talk about what I understood during my self study of the High Performance Computing. I hope that these notes will help you too.

What is High Performance Computing?

High Performance Computing (HPC) is the practice of aggregating computing power in order to solve advanced computational problems.
To aggregate computing power we need a supercomputer.

Let's try to understand how HPC works.

First of all we need to have clear in mind what is a supercomputer and which is its architecture.

A supercomputer is a computer with a high level of performance as compared to a common computer and are used to perform high computational operations. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS).
Just for the sake of comparison, take in mind that a desktop computer has performance in the range of hundreds of gigaFLOPS (10^11) to tens of teraFLOPS (10^13) while there are supercomputers which can perform over 10^17 FLOPS (100 petaFLOPS).

Supercomputers traditionally support the computation of complex simulations related to research in various industries. They range from medicine to physics, biology, chemistry, aerospace, military, climate and cybersecurity.

Supercomputers can use different computational models:

- Vector computational model, which allows to perform the same operation, arithmetic or logic, on many data simultaneously.

- Parallel computing model, in which the computational effort takes place in parallel on several processors. This logic allows to structure supercomputer systems with a modular approach, alongside an incredibly large number of computing units in this way little system memory is used.

- Cluster computing model, which is based on a series of personal computers connected to each other. It is not natively a supercomputer, but, properly, a network of simple computers, whose mass technologies can significantly reduce overall costs.

- Grid computing model,which refers to networking the computing power of various computing systems, consisting of several supercomputers, that are not necessarily in a physical proximity condition. The most famous type of grid, called "quasi-supercomputer", exploits an application that can distribute the calculation on multiple systems, which communicate data to the central server in a completely independent way.

When we first hear about supercomputers, we imagine something gigantic and extremely complex, but we struggle to imagine its shape. In fact it is more or less so, supercomputers are not small and portable like the everyday computers we are used to seeing. They have a more complex and huge structure and require large spaces, indeed Supercomputers can encompass entire rooms. Generally, how large a supercomputer is going to be is largely dependent on how much space you have and how much money you have.

Describing briefly the general architecture of a supercomputer from the lowest level to the highest, as shown in the image below, we can say that this consists of: The lowest level is the core's memory and storage, nodes, node boards, midplanes, racks and up to the whole system. Obviously, a network connects the components of the system as a whole and allows it to operate together.

Is important to mention that behind each of the racks there is a cooling unit. The purpose of the cooling unit is to move hot air through heat exchangers, which allows the system to keep an optimal temperature.

In the supercomputer the real computing unit is made up primarily of compute nodes, cores, and memory.

At this point is useful to understand what a compute node is. We can say that a compute node is a set of components and chips lying on a silver heatsink (as you can see in the figure A). There are two main components to look at in the compute node. One are the socket CPUs, and the other is the RAM memory (take a look at the figure B).

Figure A

Figure B

The CPU is the brain, where cores perform the needed calculation. Obviously, more nodes we have, more CPU and cores we have into nodes more powerful will be supercomputer. As an example, if we have a node with 2 CPUs with 12 cores each and 4 RAM slot with 16 Gb of memory storage each, we have in a single node 24 cores and 64 Gb of memory. Now imagine if, in the whole system, we have 500 nodes with the same cores and memory ram...amazing power isn't it?

So supercomputers work together as one big unit to solve larger commutation problems. And that's why you have all this large processing power. You can imagine that if you can use the entire system, it would be way bigger than any of your laptops. But, of course, we have to limit how much of a proportion of the system one person can use, because usually several users work on the same supercomputer.

All supercomputer components work together through an interconnect process. This allows the entire system to have access to all the memory and computing power on all of the nodes. The nodes can talk to each other through two types of interconnects, OmniPath and InfiniBand, which are the two main interconnects that are out there on the market now. I'll not cover this in detail but you can read something here: link1 and link2.

At this point, I hope the concept of supercomputer is clear but repeating does not hurt. So, a supercomputer is a hardware with exceptional computing performance due to its ability to aggregate computing power in order to solve advanced computational problems.

Like all hardware to use a supercomputer is necessary to communicate with this through software.
First of all, supercomputers today largely adopt open source operating systems, with a clear prevalence of Linux distributions (98%). The graphical management interfaces are the same as standard Linux servers, in this case implemented on HPC sythought tostems (such as Red Hat, Centos and so on). Applications are written primarily using the Fortran programming language. This language is able to make the most of the parallel architectures of supercomputers and is simpler than C++, especially with regard to the quality of the code processed by compilers.

Download and install some software on a supercomputer it's not terribly dissimilar to do this on a desktop or laptop.
However, managing that software get to be a little bit more of a headache because you have a lot more users that are trying to access it. You need to make sure that you have all the versions that the users want to use and that you have the appropriate dependencies. For this reason, many HPC centers use package managers to be able to manage their software.

Some important notes:

- What is a package manager?
A package manager or package-management system is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs for a computer in a consistent manner. (from Wikipedia)

- How to manage the work of multiple users in the supercomputer?
The HPC Center provides infrastructure for hundreds or thousands of users. And each user needs to share a portion of the system with the other users, so how do you figure out how much of each system each user gets to use? Well, we can do that with an allocation. And it's on the onus of the user to make sure that they justify how much of that system they need to use.

Let’s deepen the concept of nodes

There are different node types in a supercomputer infrastructure. While the specific nodes may vary by center, in general, you will run across three different node types:

1) Login nodes (or head node); which are those nodes where you typically land when logging into a system. They are not a place for heavy computation and they are not a place for running memory intensive applications. You may not realize that an application is memory intensive. For example, running a GUI directly on the login node, may bring down the login node and impact anyone that is currently logged into that particular node at that time. Login nodes are great for script editing and job submission.

2) Compile nodes; which are nodes where compile code. Typically, compile nodes have the same software stack and compilers as compute nodes. So, when you compile code on the compile nodes, it should run pretty seamlessly on the compute nodes. As a reminder, only certain languages require compiling. C, C++ and Fortran are three examples of those languages. Scripted languages like Python, R, and MATLAB are not languages that require compiling.

3) Compute nodes; which are where the submitted jobs run. It's where all your processing power is located. They're accessible indirectly through the job scheduler. This is where you can run your heavy computationally loaded jobs and really get your tasks done.

What submit a job means?

We said that different users could log into the supercomputer but what really a user can does after get in? Well, basically user could submits jobs. A job is just a task that you're asking the computer to run for you.
To be honest, there are a lot of other things a user can do besides submit jobs, but it's probably one of the primary things he'll be doing.

There's two different types of jobs as far as we're concerned with HPC systems:

a) Batch job; The batch job is the one that will run in the background, in fact you can create a text file containing information about the job or send this information directly from the command line. This allows us to go ahead with a task as the batch job submitted will queue and run when resources become available. Batch jobs spawned with the sbatch command.

b) Interactive job; As the name implies, you can work interactively at the command line of a compute node in real time. You can actually get on the compute node and do some interesting things because you have a lot of power at your fingertips. You generally can only log into computer node when you're actually running the job and not any other time. Interactive jobs spawned with the srun command

Each job is created inside a partition: a similar group of nodes grouped together with some kind of logic.

If the created job on a partition requests more resources (i.e. CPUs and/or memory) than available on the partition, said job will wait for the resources to be available. This creates a queue of jobs thus partitions are also commonly referred as queues.
As an example, our institutional cluster we have three main partitions:

  • workq partition: the default partition. It's used exclusively for batch jobs on the 17 CPU nodes. The maximum number of concurrent jobs per user is 30

  • interactive: it's used to create interactive jobs on the CPU nodes. Limits: 10 jobs per user; maximum walltime is 3 days (NOTE: walltime is the actual time of the day as reported by a wall clock).

  • cuda: allows access to the dgx01 machine. Limits: maximum 2 jobs per user; maximum 2 jobs per account (i.e. group); maximum walltime is 3 days.

Sharing requires order

Let's talk a little bit about job scheduling. Jobs have to be scheduled rather than just run because you're not the only person using the system. So, jobs are put in the queue and then they'll run when they have enough resources to do so. You're going to need some sort of software that's going to be able to distribute those jobs appropriately and manage the resources that you'll need when you need to use them.
One of the most used software for this purpose is "simple Linux utility for resource management, or Slurm."

Slurm keeps track of what nodes are busy and available and what jobs are queued or running and then it tells the resource manager to run which job on the available resources. In other words Slurm managers jobs for you.

There are several Slurm commands to submit jobs:

- sbatch; sbatch submit a batch job to Slurm. There are also lot of different flag options that you can use to tell what's exactly your job have to do.

- squeue; this basically tells you about jobs that are sitting in the scheduled queue. There are a lot of different flags that you can use and one of the flags is the --u flag, which you can use to check for a specific user.

Watch and learn

Here are some YouTube tutorials that helped me to better understand how to use Slurm so I thought to propose them in a logical order and increasing difficulty. Enjoy the videos:





At this point I would like to list some terms that you may encounter while using and studying a supercomputer.


A Central Processing Unit (CPU), or core, or CPU core, is the smallest unit in a microprocessor that can carry out computational tasks, that is, run programs. Modern processors typically have multiple cores.

- Socket

A socket is the connector that houses the microprocessor. By extension, it represents the physical package of a processor, that typically contains multiple cores.

- Node

A node is a physical, stand-alone computer, that can handle computing tasks and run jobs. It's connected to other compute nodes via a fast network interconnect, and contains CPUs, memory and devices managed by an operating system.

- Cluster

A cluster is the complete collection of nodes with networking and file storage facilities. It's usually a group of independent computers connected via a fast network interconnect, managed by a resource manager, which acts as a large parallel computer.

- Application

An application is a computer program designed to perform a group of coordinated functions, tasks, or activities for the benefit of the user. In the context of scientific computing, an application typically performs computations related to a scientific goal (molecular dynamics simulations, genome assembly, compuational fluid dynamics simulations, etc).

- Backfill

Backfill scheduling is a method that a scheduler can use in order to maximize utilization. It allows smaller (both in terms of size and time requirements), lower priority jobs to start before larger, higher priority ones, as long as doing so doesn't push back the higher-priority jobs expected start time.

- Executable program

A binary (or executable) program refers to the machine-code compiled version of an application. This is which is a binary file that a computer can execute directly. As opposed to the application source code, which is the human-readable version of the application internal instructions, and which needs to be compiled by a compiler to produce the executable binary.

- Fairshare

A resource scheduler ranks jobs by priority for execution. Each job's priority in queue is determined by multiple factors, among which one being the user's fairshare score. A user's fairshare score is computed based on a target (the given portion of the resources that this user should be able to use) and the user's effetive usage, ie the amount of resources (s)he effectively used in the past. As a result, the more resources past jobs have used, the lower the priority of the next jobs will be. Past usage is computed based on a sliding window and progressively forgotten over time. This enables all users on a shared resource to get a fair portion of it for their own use, by giving higher priorty to users who have been underserved in the past.


Floating-point Operations Per Second (FLOPS) are a measure of computing performance, and represent the number of floating-point operations that a CPU can perform each second. Modern CPUs and GPUs are capable of doing TeraFLOPS (10^12 floating-point operations per second), depending on the precision of those operations (half-precision: 16 bits, single-precision: 32 bits, double-precision: 64 bits).


A Graphical Processing Unit (GPU) is a specialized device initially designed to generate graphical output. On modern computing architecture, they are used to accelerate certain types of computation, which they are much faster than CPUs at. GPUs have their own memory, and are attached to CPUs, within a node. Each compute node can host one or more GPUs.


High Performance Computing (HPC) refers to the practice of aggregating computing power to achieve higher performance that would be possible by using a typical computer.

- Job

A job, or batch job, is the scheduler’s base unit of computing by which resources are allocated to a user for a specified amount of time. Users create job submission scripts to ask the scheduler for resources such as cores, memory, runtime, etc. The scheduler puts the requests in a queue and allocates requested resources based on jobs’ priority.

- Job steps

Job steps are sets of (possibly parallel) tasks within a job

- Modules

Environment modules, or software modules, are a type of software management tool used on in most HPC environments. Using modules enable users to selectively pick the software that they want to use and add them to their environment. This allows to switch between different versions or flavors of the same software, pick compilers, libraries and software components and avoid conflicts between them.

- Partition

A partition is a set of compute nodes within a cluster with a common feature. For example, compute nodes with GPU, or compute nodes belonging to same owner, could form a partition.

- Run time

The run time, or walltime, of a job is the time required to finish its execution.

- Scheduler

The goal of a job scheduler is to find the appropriate resources to run a set of computational tasks in the most efficient manner. Based on resource requirements and job descriptions, it will prioritize those jobs, allocate resources (nodes, CPUs, memory) and schedule their execution.


Secure Shell (SSH) is a protocol to securely access remote computers. Based on the client-server model, multiple users with an SSH client can access a remote computer. Some operating systems such as Linux and Mac OS have a built-in SSH client and others can use one of many publicly available clients.

- Thread

A process, in the simplest terms, is an executing program. One or more threads run in the context of the process. A thread is the basic unit to which the operating system allocates processor time. A thread can execute any part of the process code, including parts currently being executed by another thread. Threads are co-located on the same node.

- Task

In the Slurm context, a task is to be understood as a process. A multi-process program is made of several tasks. A task is typically used to schedule a MPI process, that in turn can use several CPUs. By contrast, a multi-threaded program is composed of only one task, which uses several CPUs.

See you soon and happy Halloween to all.