Advanced Operating Systems MS degree in Computer Engineering University of Rome Tor Vergata

### **Trap/interrupt architectures:**

- 1. Hardware hints
- 2. Relations with software and its layering
- 3. Bindings to the Linux architecture

# **Single-core traditional concepts**

- Traditional single-core machines only relied on
  - ➢Traps (synchronous events wrt software execution)
  - >Interrupts from external devices (asynchronous events)
- The classical way of handling the event has been based on running operating system code on the **unique CPU-core** in the system (single core systems) upon event acceptance
- This has been enough (in terms of consistency) even for individual concurrent (multi-thread) applications given that the state of the hardware was time-shared across threads

# Some more insights



# An example with traps (e.g. syscalls)



from this point any time-shared thread sees the correct final state as determined by trap handling

### Moving to multi-core systems



This thread does not see state B – what if the TLB on Core-1 caches the same page table (the same state portion) as the one of Core-0??

# **Core issues**

- If the system state is distributed/replicated within in the hardware architecture we need mechanisms for allowing state changes by traps/interrupts to be propagated
- As an example, a trap on Core-0 needs to be propagated on Core-1 etc.
- In some cases this is addressed by pure firmware protocols (such as when the event **is bound to deterministic handling**)
- Otherwise we need mechanisms to propagate and handle the event at the operating system (software) level

### The IPI (Inter Processor Interrupt) support

- IPI is a third type of event (beyond traps and classical interrupts) that <u>may trigger the execution of specific</u> <u>operating system software on any CPU-core</u>
- An IPI is a <u>synchronous event at the sender</u> CPU-core and an <u>asynchronous one at the recipient</u> CPU-core
- On the other hand, IPI is typically used to put in place cross CPU-core activities (e.g. request/reply protocols) allowing, e.g., a specific CPU-core to trigger a change in the state of another one
- Or to trigger a change on the hardware portion only observable by the other CPU-core

# **Priorities**

- IPIs are generated via firmware support, but are finally processed at software level (it becomes an OS matter)
- Classically, at least two priority levels are admitted
   ✓ High

✓ Low

- High priority leads to immediate processing of the IPI at the recipient (a single IPI is accepted and stands out at any point in time)
- Low priority generally leads to queue the requests and process them via sequentialization

# Actual support in x86 machines

- In x86 processors, the basic firmware support for interrupts is the so called APIC (Advanced Programmable Interrupt Controller)
- This offers a local instance to any CPU-core (called LAPIC Local APIC)
- As an example, LAPIC offers a CPU-core local programmable timer (for time tracking and time-sharing purposes)
- It also offers pseudo-registers to be used for posting IPI requests in the system
- IPI requests travel along an ad-hoc APIC bus

### The architectural scheme



### The architectural scheme evolution

| PIC Intel 8259     |                                                                                                   | IRQ0 - IRQ7                                                                       |  |  |  |  |
|--------------------|---------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|--|--|--|--|
| Two PIC Intel 8259 |                                                                                                   | IRQ0 - IRQ15                                                                      |  |  |  |  |
| IO-APIC            |                                                                                                   | Max 255 physical hardware IRQ, typical system only around 24 total hardware lines |  |  |  |  |
| IRQ 0              | System timer. Reserved for the system. Cannot be changed by a user.                               |                                                                                   |  |  |  |  |
| IRQ 1              | Keyboard. Reserved for the system. Cannot be altered even if no keyboard is present or needed.    |                                                                                   |  |  |  |  |
| IRQ 2              | Second IRQ controller. See below for explanation.                                                 |                                                                                   |  |  |  |  |
| IRQ 3              | COM 2(Default) COM 4(User)                                                                        |                                                                                   |  |  |  |  |
| IRQ 4              | COM 1(Default) COM 3(User)                                                                        |                                                                                   |  |  |  |  |
| IRQ 5              | Sound card (Sound Blaster Pro or later) or LPT2(User)                                             |                                                                                   |  |  |  |  |
| IRQ 6              | Floppy disk controller                                                                            |                                                                                   |  |  |  |  |
| IRQ 7              | LPT1(Parallel port) or sound card (8-bit Sound Blaster and compatibles)                           |                                                                                   |  |  |  |  |
| IRQ 8              | Real time clock                                                                                   |                                                                                   |  |  |  |  |
| IRQ 9              | ACPI SCI or ISA MPU-401                                                                           |                                                                                   |  |  |  |  |
| <b>IRQ 10</b>      | Free / Open interrupt / Available                                                                 |                                                                                   |  |  |  |  |
| IRQ 11             | Free / Open interrupt / Available                                                                 |                                                                                   |  |  |  |  |
| IRQ 12             | PS/2 connector Mouse / If no PS/2 connector mouse is used, this can be used for other peripherals |                                                                                   |  |  |  |  |
| IRQ 13             | Math co-processor. Cannot be changed                                                              |                                                                                   |  |  |  |  |
| IRQ 14             | Primary IDE. If no Primary IDE this can be changed                                                |                                                                                   |  |  |  |  |
| <b>IRQ 15</b>      | Secondary IDE                                                                                     |                                                                                   |  |  |  |  |

### Nomenclature

- IRQ is the actual core associate with the interrupt request (depending in hardware configuration)
- INT in the "interrupt line" as seen by the OS-kernel software
- In the essence INT = F(IRQ)
- The evaluation of the function F is typically hardware specific
- As it will be clear in a few slides, on x86 processors INT = IRQ+32
- This means that the first 32 INT lines are reserved for something, else these are the predefined traps of the hardware architecture

# **I/O APIC insights**

- I/O APIC tracks how many CPUs are in the current chipset
- It can selectively direct interrupts to the different CPU-cores
- It uses so called local APIC-ID as an identifier of the core.
- Fixed/physical operations
  - ✓ it sends interrupts from certain device to single, predefined core
- Logical/low priority operations
  - ✓ it can deliver interrupts from certain device to multiple cores in a round robin fashion
  - ✓ The destination group is of at most 8 elements (based on internal hardware constraints)

### **The Linux interface for APIC**

- /proc/interrupt tells the actual accounting of the interrupt delivery to the different CPU-cores
- /proc/irq/<IRQ#>/smp\_affinity tells what it the affinity of interrupts to CPU-copres inder the logical/low priority operating mode
- The actual setup of the I/O APIC working mode is hardcoded into kernel boot an is generally observable via the dmesg buffer

# Linux core data structures: the IDT

- It is a table of entries that are used to describe the entry point (the GATE) for the handling of any interrupt
- x86 machines have IDTs formed by 256 entries (the max amount of IRQ vectors we can generate with the I/O APIC architecture)
- The actual size and structure of the entries depends on the type of machine we are working on (say 32 vs 64 bit machines)
- Here is a high level view of the actual usage of the entries .....

# **Linux IDT bindings**

Use

Back here in a while

#### Vector range

0-19 (0x0-0x13)

-20-31 (0x14-0x1f) 32-127 (0x20-0x7f)

128 (0x80)

129-238 (0x81-0xee) 239 (0xef)

240-250 (0xf0-0xfa)

251-255 (0xfb-0xff)

Nonmaskable interrupts and exceptions

Intel-reserved

External interrupts (IRQs)

Programmed exception for system calls (segmented style)

External interrupts (IRQs)

**Local APIC timer interrupt** 

Reserved by Linux for future use

**Inter-processor interrupts** 

## What we already saw: idtr

- •The **idtr** register (interrupt descriptor table register) keeps on each CPU-core
  - ✓ the IDT <u>virtual address (expressed as a up to 6</u>
     <u>bytes 48bit linear address</u>)
  - ✓ The number of entries currently present in the IDT (expressed as 2 bytes – up to 256)
- •This is a packed structure that we can manipulate with the LIDT (Load IDT) and SIDT (Store IDT) x86 machine instructions

### x86 protected mode

- The elements of the IDT are made up by 32-bit data structures
- In more detail, the data stucture is of type struct desc\_struct
- •It is defined in include/asm-i386/desc.h as

struct desc\_struct {
 unsigned long a,b;
}

#### **Structure of the x86 protected mode IDT entry**



| DPL      | Descriptor Privilege Level                    |
|----------|-----------------------------------------------|
| Offset   | Offset to procedure entry point               |
| Р        | Segment Present flag                          |
| Selector | Segment Selector for destination code segment |
| D        | Size of gate: 1 = 32 bits; 0 = 16 bits        |
|          |                                               |

Reserved

### **Recap on relations with the GDT**

• The segment identifier/selector allows accessing the entry of the GDT where we can find the base value for the target segment

• NOTE:

- As we already know, there are 4 valid data/code segments, all mapped to base 0x0
- This is done in order to make <u>LINUX portable on</u> <u>architectures offering no segmentation support</u> (i.e. only offering paging)
- $\succ$  This is one reason why
  - ✓ Protection meta-data are also kept within page table entries
  - ✓ Setting up the offset for a GATE requires a <u>displacement</u> referring to 0x0, which can be denoted to the linker by the & operator

### The long mode x86-64 case

/\* idt.c \*/
#include "x86\_64.h"
#include <inttypes.h>

```
struct idt_t {
    uint16 t offset 0 15;
    uint16 t selector;
    unsigned ist : 3 ;
    unsigned reserved0 : 5;
    unsigned type : 4;
    unsigned zero : 1;
    unsigned dpl : 2;
    unsigned p : 1;
    uint16 t offset 16 31;
    uint32_t offset_63 32;
    uint32 t reserved1;
```



| IDT entry, Interrupt Gates |       |                               |                                                                                                                                   |                                         |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |
|----------------------------|-------|-------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Name                       | Bit   | Full Name                     |                                                                                                                                   | Description                             |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |
| Offset                     | 4863  | Offset 1631                   | Higher part of the offset.                                                                                                        |                                         |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |
| Р                          | 47    | Present                       | can be set to <b>0</b> for unused interrupts or for Paging.                                                                       |                                         |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |
| DPL                        | 45,46 | Descriptor<br>Privilege Level |                                                                                                                                   | hardv                                   | tection. Specifies which privilege Level the calling Descriptor minimum should dware and CPU interrupts can be protected from beeing called out of |                                                                                                                                                     |  |  |  |  |
| S                          | 44    | Storage<br>Segment            | = 0 for interrupt gates.                                                                                                          |                                         |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |
| Туре                       | 4043  | Gate Type 03                  | 0b0110<br>0b0111<br>0b1110                                                                                                        | DT g<br>0x5<br>0x6<br>0x7<br>0xE<br>0xF | 5<br>6<br>7<br>14                                                                                                                                  | types :<br>80386 32 bit Task gate<br>80286 16-bit interrupt gate<br>80286 16-bit trap gate<br>80386 32-bit interrupt gate<br>80386 32-bit trap gate |  |  |  |  |
| 0                          | 3239  | Unused 07                     | Have to be <b>0</b> .                                                                                                             |                                         |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |
| Selector                   | 1631  | Selector 015                  | Selector of the interrupt function (to make sense - the kernel's selector). The selector's descriptor's DPL field has to be $0$ . |                                         |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |
| Offset                     | 015   | Offset 015                    | Lower part of the interrupt function's offset address (also known as pointer).                                                    |                                         |                                                                                                                                                    |                                                                                                                                                     |  |  |  |  |

.

### Long mode IDT entry structure



## x86 long mode fully new concepts: IST

- The **Interrupt Stack Table (IST)** is available as an alternative to handle stack switch upon traps/interrupts
- This mechanism unconditionally switches stacks when it is enabled on each individual interrupt-vector basis using a field in the IDT entry
- This means that some interrupt vectors can selectively use the IST mechanism
- IST provides a method for specific interrupts (such as NMI, double-fault, and machine-check) to always execute on a known good stack
- The IST mechanism provides <u>up to seven IST pointers</u> in the TSS

### A scheme



These are typically the primary stacks (possibly of different size) for processing a given trap/interrupts Software will then switch to the classical kernel level stack of the running task if nothing prevents it (e.g. a double fault)

### Macros for setting IDT entries (x86 protected mode)

Within the arch/i386/kernel/traps.c file we can find the declaration of the following macros that can be used for setting up one entry of the IDT

set\_trap\_gate(displacement,&symbol\_name)
 set\_intr\_gate(displacement,&symbol\_name)
 set\_system\_gate(displacement,&symbol\_name)

- displacement indicates the target entry of the IDT
- &simbol\_name identifies the segment displacement (starting from 0x0) which determines the address of the software module to be invoked for handling the trap or the interrupt

#### Main differences among the modules

- The set\_trap\_gate() function initializes one IDT entry such in away to define the value 0 as the privilege level admitted for accessing the GATE via software
- Therefore we cannot rely on the INT assembly instruction unless we are already executing in kernel mode
- The set\_intr\_gate() function looks similar, however the handler activation relies on interrupt masking
- set\_system\_gate() is similar to set\_trap\_gate() however it defines the value 3 as the level of privilege admitted for accessing the GATE

# Variants for x86 long mode

#### CODE SNIPPET FROM desc.h

409 /\*

410 \* This routine sets up an interrupt gate at directory privilege level 3.

411 \*/

412 static inline void set\_system\_intr\_gate(unsigned int n, void \*addr)

413 {

- 414 BUG\_ON((unsigned)n > 0xFF);
- 415 \_\_set\_gate(n, GATE\_INTERRUPT, addr, 0x3, 0, \_\_KERNEL\_CS);

416 }

417

418 static inline void set\_system\_trap\_gate(unsigned int n, void \*addr)

419 {

- 420 BUG\_ON((unsigned)n > 0xFF);
- 421 \_\_set\_gate(n, GATE\_TRAP, addr, 0x3, 0, \_\_KERNEL\_CS);

422 }

423

424 static inline void set\_trap\_gate(unsigned int n, void \*addr)

425 {

426 BUG\_ON((unsigned)n > 0xFF);

```
427 __set_gate(n, GATE_TRAP, addr, 0, 0, __KERNEL_CS);
```

428 }

#### i386/kernel-2.4 examples

Handler managing division errors set\_trap\_gate(0,&divide\_error)

Handler for non-maskable interrupts set\_intr\_gate(2,&nmi)

Handler used for dispatching system calls set\_system\_gate(SYSCALL\_VECTOR,&system\_call)

#### **Reserved vs available IDT entries**

- The entries from 0 to 31 are reserved for handlers that are used to manage specific (predefined) events/conditions (such as divide by 0 or page fault) or are already planned for future use ... these are mostly traps
- This is based on hardware design/requirements
- All the other entries are available for system programming purposes
- As an example, the entry at displacement 0x80 has been traditionally used for kernel level access via system calls
- We note that for some of the reserved entries, microcode tasks generate a so called error-code to be passed to the handler .....

#### **Reserved vs available IDT entries**

- If needed, the handler needs to be structured such in a way to be aware of the production of the error-code
- Particularly, beyond exploiting the error-code value, it needs to remove it from, e.g., the stack right before returning from trap/interrupt (IRET)
- Non-reserved entries area managed by the microcode with no generation of any error-code value

#### **Management of trap handlers**



### Modular handler management: i386 case

- Trap/interrupt handlers are typically defined via ASM code within arch/i386/kernel/entry.S (this file also keeps the specification of the system call dispatcher, which is a trap handler)
- All the handlers associated with predetermined trap/interrupts (namely those from 0 to 31) are managed <u>via an additional</u> <u>dispatcher</u>
- Initially, each handler logs a null-value into the stack in case no error-code is generated in relation to the specific trap/interrupt
- Then it logs into the stack the address of the actual handlerfunction (typically written in C)

### Modular handler management: i386 case

- After, an assembly module, operating the dispatching, is activated
- This logs the CPU context and gives control to the handler via a conventional call
- Given that the input parameters are passed via the stack, the handlers will need to be compiled with <code>asmlinkage</code> directives (or more modern <code>dotraplinkage</code>)
- ... in more modern Linux kernel flavors (<u>e.g. x86 long</u>), the layering is a bit more articulated, but the basic concepts are the same
- <u>One thing which is dealth with explicitly is IST and the stack</u> <u>frame redirection</u>

#### The actual scheme

#### trap/interrupt



### Examples

ENTRY(overflow) No error code by firmware pushl \$0 pushl \$ SYMBOL\_NAME(do\_overflow) jmp error\_code

ENTRY(general\_protection) pushl \$ SYMBOL\_NAME(do\_general\_protection) jmp error\_code

ENTRY(page\_fault) pushl \$ SYMBOL\_NAME(do\_page\_fault) jmp error\_code

Error code already posted firmware

#### The error\_code block (still i386 case)

- The assembler code block called error\_code is in charge of logging the CPU context into the stack
- This is done by aligning the stack content with the following data structure defined in include/asm-i386/ptrace.h

```
struct pt_regs {
   long ebx; long ecx;
   long edx; long esi;
   long edi; long ebp;
   long eax; int xds; int xes;
   long orig_eax; long eip; int xcs;
   long eflags; long esp; int xss;
}
```

• The actual handler can take as input a pt\_regs\* pointer and, if needed, an unsigned long representing the error-code

#### struct pt\_regs for x86 long mode

```
struct pt regs {
      unsigned long r15; ... unsigned long r12;
      unsigned long bp;
      unsigned long bx; /* arguments: non interrupts/non
tracing syscalls only save up to here*/
      unsigned long r11; ... unsigned long r8;
      unsigned long ax;
      unsigned long cx;
      unsigned long dx;
      unsigned long si;
      unsigned long di;
      unsigned long orig ax; /* end of arguments */ /* cpu
exception frame or undefined */
      unsigned long ip;
      unsigned long cs;
      unsigned long flags;
      unsigned long sp;
      unsigned long ss; /* top of stack page */
```

#### The page fault handler: main features

- The page fault handler is do\_page\_fault(struct pt\_regs \*regs, unsigned long error\_code) and is defined in linux/arch/x86/mm/fault.c
- It takes as input the error-code determining the type of the occurred fault, which needs to be handled
- The fault type is specified via the three least significant bits of error\_code according to the following rules
  - > bit 0 == 0 means no page found, 1 means protection
    fault
  - > bit 1 == 0 means read, 1 means write
  - > bit 2 == 0 means kernel, 1 means user-mode

#### x86-64 early trap/interrupt stack layout details

Interrupt-Handler Stack

With No Error Code With Error Code Return SS +40**Return SS** +32Return RSP +32**Return RFLAGS** +24 Return RSP +24Return RFLAGS Return CS +16 +16**Return CS** Return RIP +8+8 **Return RIP** Error Code <- RSP

Coming from where?

# **Back to IPI**

- Immediate handling is allowed for the case in which there are no data structures that are shared across CPU-cores that need to be accessed for the handling (kind of stateless scenarios)
- An example is the system-halt (e.g. upon panic)
- Other usages of IPI are
  - Execution on a same function across all the CPU-cores (exactly like the halt)
  - ✓ Change of the state of hardware components across multiple CPU-cores in the system (e.g. the TLB state)
  - ✓ Manage/initialize per-CPU variables

#### Actual IPI usage in Linux: a few examples

CALL\_FUNCTION\_VECTOR (vector 0xfb)

Sent to all CPUs but the sender, forcing those CPUs to run a function passed by the sender. The corresponding interrupt handler is named call\_function\_interrupt(). Usually this interrupt is sent to all CPUs except the CPU executing the calling function by means of the smp\_call\_function() facility function.

RESCHEDULE\_VECTOR (vector 0xfc) W

When a CPU receives this type of interrupt, the corresponding handler, named **reschedule\_interrupt()**, limits itself to acknowledge the interrupt.

INVALIDATE\_TLB\_VECTOR (vector 0xfd)

Sent to all CPUs but the sender, forcing them to invalidate their Translation Lookaside Buffers. The corresponding handler, named invalidate\_interrupt()

# **Actual IPI API**

```
send_IPI_all( )
Sends an IPI to all CPUs (including the sender)
```

```
send_IPI_allbutself( )
   Sends an IPI to all CPUs except the sender
```

```
send_IPI_self( )
   Sends an IPI to the sender CPU
```

```
send_IPI_mask( )
```

Sends an IPI to a group of CPUs specified by a bit mask

## **Sequentialization of IPI management**

- The sequentializing approach is used in case the IPI requires managing a shared data structure across the threads
- This is the typical case of IPI that require <u>specific</u> <u>parameters for correct management</u>
- These parameters are in fact <u>passed into predetermined</u> <u>memory locations</u> accessible to all the CPU-cores, whose position in memory is predetermined
- The classical case is the one of smp-call-function, whose function pointer and parameter are both passed into a global table

## The scheme



```
207 int smp_call_function(void (*_func)(void *info), void *_info, int wait)
208 {
            Can deadlock when called with interrupts disabled */
215
                                                                          -Beware this!!
        \WARN_ON(irqs_disabled());
216
217
218
         spin_lock_bh(&call_lock);
         atomic set(&scf started, 0);
219
         atomic_set(&scf_finished, 0);
220
         func = func;
221
222
         info = _info;
223
224
         for_each_online_cpu(i)
225
             os_write_file(cpu_data[i].ipi_pipe[1], "C", 1);
226
227
         while (atomic_read(&scf_started) != cpus)
228
             barrier();
229
230
         if (wait)
231
              while (atomic read(&scf finished) != cpus)
232
                  barrier();
233
234
         spin_unlock_bh(&call_lock);
235
         return 0;
```

## **IPI additional effects**

- As noted before, one IPI used by Linux is the **reschedule** one
- This may lead to preemption of the task running on the CPU-core targeted by the IPI
- This may have effects on both
  - ✓ Correctness/consistency
  - ✓ Performance

## **Consistency** aspects

- What about running a piece of code which is <u>CPU-specific</u> and preemption occur??
- One example

struct \_the\_struct v[NR\_CPUS]; v[smp\_processor\_id()] = some\_value; /\* task is preempted here... \*/ something = v[smp\_processor\_id()];

We may be targeting different entries

## **Performance aspects**

- smp\_call\_function() tipcally runs with
  interrupts allowed ... just remember the
  deadlock issue!!
- But we cannot risk to have some smp\_call\_function() runner getting
  context switched off the CPU
- Otherwise the release of the smp\_call\_function() resources (e.g. the spinlock) might be delayed

• .... and we might even deadlock anyhow!!

# How to run with interrupts but no actual preemption

- We use per-CPU atomic counters
- If the counter is not zero then no preemption will take place (although we can be targeted by interrupts)
- The check in clearly done via software upon attempting to process the preemption interrupt
- Beware managing the preemption counter explicitly if required!!

## **Preemption enabling/disabling API**

preempt enable() // decrement the preempt counter preempt disable() // increment the preempt counter preempt enable no resched() decrement, but do not immediately preempt preempt\_check resched()  $\vee$  / if needed, reschedule preempt count() return the preempt counter put cpu() /get cpu() //decrase/increase the counter (enable/disable preemption)

Variants of each other

#### **Preemption vs SMP function calls**

int smp\_call\_function(void (\*func) (void \*info), void \*info, int
nonatomic, int wait) {

```
cpumask t map;
preempt disable();
map = cpu online map;
cpu clear(smp processor id(), map);
   smp call function map(func, info, nonatomic, wait,
                            map);
preempt enable();
return 0;
                              Internal structure with
                              preemption awareness
```

## **Be careful**

- IPI is an extremely powerful technology
- However you need to consider scalability aspects
- This leads to conclude that IPI schemes involving large counts of CPU-cores need to be used only when mandatorily needed
- The classical example is when patching the kernel on line, e.g. upon mounting a module