Learning operating system development using Linux kernel and Raspberry Pi

View on GitHub

4.2: Scheduler basic structures

In all previous lessons we have been working mostly with either architecture specific code or driver code, and now it is the first time we will dig deep into the core of the Linux kernel. This task isn’t simple, and it requires some preparations: before you will be able to understand the Linux scheduler source code, you need to become familiar with a few major concepts that the scheduler is based on.


This is one of the most critical structures in the whole kernel — it contains all information about a running task. We already briefly touched task_struct in lesson 2 and we even have implemented our own task_struct for the RPi OS, so I assume that by this time you should already have a basic understanding how it is used. Now I want to highlight a few important fields of this struct that are relevant to our discussion.

Scheduler class

In Linux, there is an extendable mechanism that allows each task to use its own scheduling algorithm. This mechanism uses a structure sched_class. You can think about this structure as an interface that defines all methods that a schedules class have to implement. Let’s see what kind of methods are defined in the sched_class interface. (Not all of the methods are shown, but only those that I consider the most important for us)

There are a few sched_class implementations. The most commonly used one, which is typically used for all user tasks, is called “Completely Fair Scheduler (CFS)”

Completely Fair Scheduler (CFS)

The principals behind CFS algorithm are very simple:

  1. For each task in the system, CFS measures how much CPU time has been already allocated to it (The value is stored in sum_exec_runtime field of the sched_entity structure)
  2. sum_exec_runtime is adjusted accordingly to task priority and saved in vruntime field.
  3. When CFS need to pick a new task to execute, the one with the smallest vruntime is selected.

Linux scheduler uses another important data structure that is called “runqueue” and is described by the rq struct. There is a single instance of a runqueue per CPU. When a new task needs to be selected for execution, the selection is made only from the local runqueue. But if there is a need, tasks can be balanced between different rq structures.

Runqueues are used by all scheduler classes, not only by CFS. All CFS specific information is kept in cfs_rq struct, which is embedded in the rq struct. One important field of the cfs_rq struct is called min_vruntime — this is the lowest vruntime from all tasks, assigned to a runqueue. min_vruntime is assigned to a newly forked task — this ensures that the task will be selected next, because CFS always ensures that a task with the smallest vruntime is picked. This approach also ensures that the new task will not be running for an unreasonably long time before it will be preempted.

All tasks, assigned to a particular runqueue and tracked by CFS are kept in tasks_timeline field of the cfs_rq struct. tasks_timeline represents a Red–black tree, which can be used to pick tasks ordered by their vruntime value. Red-black trees have an important property: all operations on it (search, insert, delete) can be done in O(log n) time. This means that even if we have thousands of concurrent tasks in the system all scheduler methods still executes very quickly. Another important property of a red-black tree is that for any node in the tree its right child will always have larger vruntime value than the parent, and left child’s vruntime will be always less or equal then the parent’s vruntime. This has an important implication: the leftmost node is always the one with the smallest vruntime.

struct pt_regs

I have already shown you how all general purpose registers are saved on the stack when an interrupt is taken — we explored how this process works in Linux kernel and implemented a similar one for the RPi OS. What I haven’t mentioned yet, is that it is entirely legal to manipulate those registers while processing an interrupt. It is also legal to manually prepare a set of registers and put them on the stack — this is done when a kernel thread is moved to a user mode, and we are going to implement the same functionality in the next lesson. For now, you just need to remember that pt_regs structure is used to describe saved registers and it must have its fields ordered in the same order in with register are saved in the kernel_entry macro. In this lesson, we are going to see a few examples of how this structure is used.


There are far more important structures, algorithms and concepts related to scheduling, but what we’ve just explored is a minimal set, required to understand the next two chapters. Now it’s time to see how all of this actually works.

Previous Page

4.1 Process scheduler: RPi OS Scheduler

Next Page

4.3 Process scheduler: Forking a task