3.2: Low-level exception handling in Linux
Given huge Linux kernel source code, what is a good way to find the code that is responsible for interrupt handling? I can suggest one idea. Vector table base address should be stored in the ‘vbar_el1’ register, so, if you search for
vbar_el1, you should be able to figure out where exactly the vector table is initialized. Indeed, the search gives us just a few usages, one of which belongs to already familiar to us head.S. This code is inside __primary_switched function. This function is executed after the MMU is switched on. The code looks like the following.
adr_l x8, vectors // load VBAR_EL1 with virtual msr vbar_el1, x8 // vector table address
From this code, we can infer that the vector table is called
vectors and you should be able to easily find its definition.
/* * Exception vectors. */ .pushsection ".entry.text", "ax" .align 11 ENTRY(vectors) kernel_ventry el1_sync_invalid // Synchronous EL1t kernel_ventry el1_irq_invalid // IRQ EL1t kernel_ventry el1_fiq_invalid // FIQ EL1t kernel_ventry el1_error_invalid // Error EL1t kernel_ventry el1_sync // Synchronous EL1h kernel_ventry el1_irq // IRQ EL1h kernel_ventry el1_fiq_invalid // FIQ EL1h kernel_ventry el1_error_invalid // Error EL1h kernel_ventry el0_sync // Synchronous 64-bit EL0 kernel_ventry el0_irq // IRQ 64-bit EL0 kernel_ventry el0_fiq_invalid // FIQ 64-bit EL0 kernel_ventry el0_error_invalid // Error 64-bit EL0 #ifdef CONFIG_COMPAT kernel_ventry el0_sync_compat // Synchronous 32-bit EL0 kernel_ventry el0_irq_compat // IRQ 32-bit EL0 kernel_ventry el0_fiq_invalid_compat // FIQ 32-bit EL0 kernel_ventry el0_error_invalid_compat // Error 32-bit EL0 #else kernel_ventry el0_sync_invalid // Synchronous 32-bit EL0 kernel_ventry el0_irq_invalid // IRQ 32-bit EL0 kernel_ventry el0_fiq_invalid // FIQ 32-bit EL0 kernel_ventry el0_error_invalid // Error 32-bit EL0 #endif END(vectors)
Looks familiar, isn’t it? And indeed, I’ve copied most of this code and just simplified it a little bit. kernel_ventry macro is almost the same as ventry, defined in the RPi OS. One difference, though, is that
kernel_ventry also is responsible for checking whether a kernel stack overflow has occurred. This functionality is enabled if
CONFIG_VMAP_STACK is set and it is a part of the kernel feature that is called
Virtually mapped kernel stacks. I’m not going to explain it in details here, however, if you are interested, I can recommend you to read this article.
kernel_entry macro should also be familiar to you. It is used exactly in the same way as the corresonding macro in the RPI OS. Original (Linux) version, however, is a lot more complicated. The code is listed below.
.macro kernel_entry, el, regsize = 64 .if \regsize == 32 mov w0, w0 // zero upper 32 bits of x0 .endif stp x0, x1, [sp, #16 * 0] stp x2, x3, [sp, #16 * 1] stp x4, x5, [sp, #16 * 2] stp x6, x7, [sp, #16 * 3] stp x8, x9, [sp, #16 * 4] stp x10, x11, [sp, #16 * 5] stp x12, x13, [sp, #16 * 6] stp x14, x15, [sp, #16 * 7] stp x16, x17, [sp, #16 * 8] stp x18, x19, [sp, #16 * 9] stp x20, x21, [sp, #16 * 10] stp x22, x23, [sp, #16 * 11] stp x24, x25, [sp, #16 * 12] stp x26, x27, [sp, #16 * 13] stp x28, x29, [sp, #16 * 14] .if \el == 0 mrs x21, sp_el0 ldr_this_cpu tsk, __entry_task, x20 // Ensure MDSCR_EL1.SS is clear, ldr x19, [tsk, #TSK_TI_FLAGS] // since we can unmask debug disable_step_tsk x19, x20 // exceptions when scheduling. mov x29, xzr // fp pointed to user-space .else add x21, sp, #S_FRAME_SIZE get_thread_info tsk /* Save the task's original addr_limit and set USER_DS (TASK_SIZE_64) */ ldr x20, [tsk, #TSK_TI_ADDR_LIMIT] str x20, [sp, #S_ORIG_ADDR_LIMIT] mov x20, #TASK_SIZE_64 str x20, [tsk, #TSK_TI_ADDR_LIMIT] /* No need to reset PSTATE.UAO, hardware's already set it to 0 for us */ .endif /* \el == 0 */ mrs x22, elr_el1 mrs x23, spsr_el1 stp lr, x21, [sp, #S_LR] /* * In order to be able to dump the contents of struct pt_regs at the * time the exception was taken (in case we attempt to walk the call * stack later), chain it together with the stack frames. */ .if \el == 0 stp xzr, xzr, [sp, #S_STACKFRAME] .else stp x29, x22, [sp, #S_STACKFRAME] .endif add x29, sp, #S_STACKFRAME #ifdef CONFIG_ARM64_SW_TTBR0_PAN /* * Set the TTBR0 PAN bit in SPSR. When the exception is taken from * EL0, there is no need to check the state of TTBR0_EL1 since * accesses are always enabled. * Note that the meaning of this bit differs from the ARMv8.1 PAN * feature as all TTBR0_EL1 accesses are disabled, not just those to * user mappings. */ alternative_if ARM64_HAS_PAN b 1f // skip TTBR0 PAN alternative_else_nop_endif .if \el != 0 mrs x21, ttbr0_el1 tst x21, #0xffff << 48 // Check for the reserved ASID orr x23, x23, #PSR_PAN_BIT // Set the emulated PAN in the saved SPSR b.eq 1f // TTBR0 access already disabled and x23, x23, #~PSR_PAN_BIT // Clear the emulated PAN in the saved SPSR .endif __uaccess_ttbr0_disable x21 1: #endif stp x22, x23, [sp, #S_PC] /* Not in a syscall by default (el0_svc overwrites for real syscall) */ .if \el == 0 mov w21, #NO_SYSCALL str w21, [sp, #S_SYSCALLNO] .endif /* * Set sp_el0 to current thread_info. */ .if \el == 0 msr sp_el0, tsk .endif /* * Registers that may be useful after this macro is invoked: * * x21 - aborted SP * x22 - aborted PC * x23 - aborted PSTATE */ .endm
Now we are going to explore the
kernel_entry macro in details.
.macro kernel_entry, el, regsize = 64
The macro accepts 2 parameters:
el can be either
1 depending on whether an exception was generated at EL0 or EL1.
regsize is 32 if we came from 32-bit EL0 or 64 otherwise.
.if \regsize == 32 mov w0, w0 // zero upper 32 bits of x0 .endif
In 32-bit mode, we use 32-bit general purpose registers (
w0 instead of
w0 is architecturally mapped to the lower part of
x0. The provided code snippet zeroes upper 32 bits of the
x0 register by writing
w0 to itself.
stp x0, x1, [sp, #16 * 0] stp x2, x3, [sp, #16 * 1] stp x4, x5, [sp, #16 * 2] stp x6, x7, [sp, #16 * 3] stp x8, x9, [sp, #16 * 4] stp x10, x11, [sp, #16 * 5] stp x12, x13, [sp, #16 * 6] stp x14, x15, [sp, #16 * 7] stp x16, x17, [sp, #16 * 8] stp x18, x19, [sp, #16 * 9] stp x20, x21, [sp, #16 * 10] stp x22, x23, [sp, #16 * 11] stp x24, x25, [sp, #16 * 12] stp x26, x27, [sp, #16 * 13] stp x28, x29, [sp, #16 * 14]
This part saves all general purpose registers on the stack. Note, that stack pointer was already adjusted in the kernel_ventry to fit everything that needs to be stored. The order in which we save registers matters because in Linux there is a special structure pt_regs that is used to access saved registers later inside an exception handler. As you might see this structure contains not only general purpose registers but also some other information, which is mostly populated later in the
kernel_entry macro. I recommend you to remember
pt_regs struct because we are going to implement and use a similar one in the next few lessons.
.if \el == 0 mrs x21, sp_el0
x21 now contains aborted stack pointer. Note, that a task in Linux uses 2 different stacks for user and kernel mode. In case of user mode, we can use
sp_el0 register to figure out the stack pointer value at the moment when the exception was generated. This line is very important because we need to swap stack pointers during the context switch. We will talk about it in details in the next lesson.
ldr_this_cpu tsk, __entry_task, x20 // Ensure MDSCR_EL1.SS is clear, ldr x19, [tsk, #TSK_TI_FLAGS] // since we can unmask debug disable_step_tsk x19, x20 // exceptions when scheduling.
MDSCR_EL1.SS bit is responsible for enabling “Software Step exceptions”. If this bit is set and debug exceptions are unmasked, an exception is generated after any instruction has been executed. This is commonly used by debuggers. When taking exception from user mode, we need to check first whether TIF_SINGLESTEP flag is set for the current task. If yes, this indicates that the task is executing under a debugger and we must unset
The important thing to understand in this code is how information about the current task is obtained. In Linux, each process or thread (later I will reference any of them as just “task”) has a task_struct associated with it. This struct contains all metadata information about a task. On
task_struct embeds another structure that is called thread_info so that a pointer to
task_struct can always be used as a pointer to
thread_info is the place were flags are stored along with some other low-level values that
entry.S need direct access to.
mov x29, xzr // fp pointed to user-space
x29 is a general purpose register it usually has a special meaning. It is used as a “Frame pointer”. Now I want to spend some time to explain its purpose.
When a function is compiled, the first couple of instructions are usually responsible for storing old frame pointer and link register values on the stack. (Just a quick reminder:
x30 is called link register and it holds a “return address” that is used by the
ret instruction) Then a new stack frame is allocated, so that it can contain all local variables of the function, and frame pointer register is set to point to the bottom of the frame. Whenever the function needs to access some local variable it simply adds hardcoded offset to the frame pointer. Imagine now that an error has occurred and we need to generate a stack trace. We can use current frame pointer to find all local variables in the stack, and the link register can be used used to figure out the precise location of the caller. Next, we take advantage of the fact that old frame pointer and link register values are always saved at the beginning of the stack frame, and we just read them from there. After we get caller’s frame pointer we can now access all its local variables as well. This process is repeated recursively until we reach the top of the stack and is called “stack unwinding”. A similar algorithm is used by ptrace system call.
Now, going back to the
kernel_entry macro, it should be clear why do we need to clear
x29 register after taking an exception from EL0. That is because in Linux each task uses a different stack for user and kernel mode, and therefore it doesn’t make sense to have common stack traces.
.else add x21, sp, #S_FRAME_SIZE
Now we are inside else clause, which mean that this code is relevant only if we are handling an exception taken from EL1. In this case, we are reusing old stack and the provided code snippet just saves original
sp value in the
x21 register for later usage.
/* Save the task's original addr_limit and set USER_DS (TASK_SIZE_64) */ ldr x20, [tsk, #TSK_TI_ADDR_LIMIT] str x20, [sp, #S_ORIG_ADDR_LIMIT] mov x20, #TASK_SIZE_64 str x20, [tsk, #TSK_TI_ADDR_LIMIT]
Task address limit specifies the largest virtual address that can be used. When user process operates in 32-bit mode this limit is
2^32. For 64 bit kernel it can be larger and usually is
2^48. If it happens that an exception is taken from 32-bit EL1, task address limit need to be changed to TASK_SIZE_64. Also, it is required to save the original address limit because it needs to be restored before the execution will be returned to user mode.
mrs x22, elr_el1 mrs x23, spsr_el1
spsr_el1 must be saved on the stack before we start handling an exception. We haven’t done it yet in the RPI OS, because for now we always return to the same location from which an exception was taken. But what if we need to do a context switch while handling an exception? We will discuss this scenario in details in the next lesson.
stp lr, x21, [sp, #S_LR]
Link register and frame pointer registers are saved on the stack. We already saw that frame pointer is calculated differently depending on whether an exception was taken from EL0 or EL1 and the result of this calculation was already stored in
/* * In order to be able to dump the contents of struct pt_regs at the * time the exception was taken (in case we attempt to walk the call * stack later), chain it together with the stack frames. */ .if \el == 0 stp xzr, xzr, [sp, #S_STACKFRAME] .else stp x29, x22, [sp, #S_STACKFRAME] .endif add x29, sp, #S_STACKFRAME
Here stackframe property of the
pt_regs struct is filled. This property also contains link register and frame pointer, though this time the value of
elr_el1 (which is now in
x22) is used instead of
stackframe is used solely for stack unwinding.
#ifdef CONFIG_ARM64_SW_TTBR0_PAN alternative_if ARM64_HAS_PAN b 1f // skip TTBR0 PAN alternative_else_nop_endif .if \el != 0 mrs x21, ttbr0_el1 tst x21, #0xffff << 48 // Check for the reserved ASID orr x23, x23, #PSR_PAN_BIT // Set the emulated PAN in the saved SPSR b.eq 1f // TTBR0 access already disabled and x23, x23, #~PSR_PAN_BIT // Clear the emulated PAN in the saved SPSR .endif __uaccess_ttbr0_disable x21 1: #endif
CONFIG_ARM64_SW_TTBR0_PAN parameter prevents the kernel from accessing user-space memory directly. If you are wondering when this might be useful you can read this article. For now, I will also skip the detailed explanation of how this works, because such security features are too out of scope for our discussion.
stp x22, x23, [sp, #S_PC]
spsr_el1 are saved on the stack.
/* Not in a syscall by default (el0_svc overwrites for real syscall) */ .if \el == 0 mov w21, #NO_SYSCALL str w21, [sp, #S_SYSCALLNO] .endif
pt_regs struct has a field indicating whether the current exception is a system call or not. By default, we assume that it isn’t. Wait till lecture 5 for the detailed explanation how syscalls work.
/* * Set sp_el0 to current thread_info. */ .if \el == 0 msr sp_el0, tsk .endif
When a task is executed in kernel mode,
sp_el0 is not needed. Its value was previously saved on the stack so it can be easily restored in
kernel_exit macro. Starting from this point
sp_el0 will be used to hold a pointer to current task_struct for quick access.
Next thing we are going to explore is the handler that is responsible for processing IRQs taken from EL1. From the vector table we can easily find out that the handler is called
el1_irq and is defined here. Let’s take a look on the code now and examine it line by line.
el1_irq: kernel_entry 1 enable_dbg #ifdef CONFIG_TRACE_IRQFLAGS bl trace_hardirqs_off #endif irq_handler #ifdef CONFIG_PREEMPT ldr w24, [tsk, #TSK_TI_PREEMPT] // get preempt count cbnz w24, 1f // preempt count != 0 ldr x0, [tsk, #TSK_TI_FLAGS] // get flags tbz x0, #TIF_NEED_RESCHED, 1f // needs rescheduling? bl el1_preempt 1: #endif #ifdef CONFIG_TRACE_IRQFLAGS bl trace_hardirqs_on #endif kernel_exit 1 ENDPROC(el1_irq)
The following is done inside this function.
kernel_exitmacros are called to save and restore processor state. The first parameter indicates that the exception is taken from EL1.
- Debug interrupts are unmasked by calling
enable_dbgmacro. At this point, it is safe to do so, because the processor state is already saved and, even if debug exception occurred in the middle of the interrupt handler, it will be processed correctly. If you wonder why is it necessary to unmask debug exceptions during an interrupt processing in the first place - read this commit message.
- Code inside
#ifdef CONFIG_TRACE_IRQFLAGSblock is responsible for tracing interrupts. It records 2 events: interrupt start and end.
- Code inside
#ifdef CONFIG_PREEMPTblock access current task flags to check whether we need to call the scheduler. This code will be examined details in the next lesson.
irq_handler- this is the place were actual interrupt handling is performed.
irq_handler is a macro and it is defined as the follows.
.macro irq_handler ldr_l x1, handle_arch_irq mov x0, sp irq_stack_entry blr x1 irq_stack_exit .endm
As you might see from the code,
irq_handler executes handle_arch_irq function. This function is executed with special stack, that is called “irq stack”. Why is it necessary to switch to a different stack? In RPI OS, for example, we didn’t do this. Well, I guess it is not necessary, but without it, an interrupt will be handled using task stack, and we can never be sure how much of it is still left for the interrupt handler.
Next, we need to look at handle_arch_irq. It appears that it is not a function, but a variable. It is set inside set_handle_irq function. But who sets it, and what is the fade of an interrupt after it reaches this point? We will figure out the answer in the next chapter of this lesson.
As a conclusion, I can say that we’ve already explored the low-level interrupt handling code and trace the path of an interrupt from the vector table all the way to the
handle_arch_irq. This is the point were an interrupt leaves architecture specific code and started to be handled by a driver code. Our goal in the next chapter will be to trace the path of a timer interrupt through the driver source code.