

# **Exercise Session 14**

Systems Programming and Computer Architecture

Devices

Fall Semester 2024

### Disclaimer

- Website: n.ethz.ch/~falkbe/
- (Extra) Demos on GitHub: github.com/falkbe
- Kahoots: now on website n.ethz.ch/~falkbe/
- My exercise slides have additional slides (which are not official part of the course) having a blue heading
- For the exam only the official exercise slides are relevant, if in doubt always check the ones on the official moodle page

#### **Overview**

- Lecture Recap: Devices
- SPCA in a nutshell
- Exam Remarks
- SPCA in perspective
- Questions

#### **Devices and Device Drivers**

Systems Programming and Computer Architecture

- **Devices**: Device Registers/Dealing with caches
- **Device Driver**: Operating System Part of the device

- **Devices**: Device Registers/Dealing with caches
- **Device Driver**: Operating System Part of the device

### **SPCA Devices**



Figure 8.28 Support hardware for memory-mapped I/O

### **SPCA Devices**



- Input/Output (I/O) systems are used to connect a computer with external devices called peripherals (keyboards, monitors, etc.)
- Processor accesses an I/O device using the address/data busses the same way as it

accesses memory

• Part of address space is dedicated to I/O devices rather than memory: a store sends

data to the device, with a load we recieve data from the device

• => This method of communicating is called **memory mapped I/O** 

#### What is a device?

Specifically, to an OS programmer:

- Piece of hardware visible from software
- Occupies some location on a bus
- Set of registers
  - Memory mapped or I/O space
- Source of interrupts
- May initiate Direct Memory Access transfers

#### Registers

#### • CPU can load from device registers:

- Obtain status info
- Read input data
- CPU can store to device registers:
  - Set device state and configuration
  - Write output data
  - Reset states

#### Addressing registers

- 1. Memory mapped:
  - Registers appear as memory locations
  - Access using loads/stores (movb/movw/mov1/movq)
- 2. "I/O instructions":
  - Different (16 bit) address space for older I/O devices
  - Specific (these days) to x86 architecture
  - Special instructions: inb, outb, etc.

#### Registers are not memory

#### **Device registers don't behave like RAM!**

- Register contents change without writes from CPU
  - Status words
  - Incoming data
- Writes to registers are used to trigger actions
  - Sending data
  - Resetting state machines

#### Dealing with caches

- Reads can't come from the cache
  - Register value changes  $\Rightarrow$  cache becomes **inconsistent**
- Write-back caches (and write buffers) cause problems
  - You don't know when the line will be written
- Reads and writes cannot be combined into cache lines
  - Registers might require single word or byte writes only
  - Line-size writes stomp on other registers
  - Even spurious reads trigger device state changes
- $\Rightarrow$  Device register access **must** bypass the cache
  - Handled in the MMU: PTEs have "no cache" flag
  - I/O space access isn't cached anyway

- **Devices**: Device Registers/Dealing with caches
- **Device Driver**: Operating System Part of the device



#### Basic model



## Devices Very simple UART driver

```
#define UART_BASE 0x3f8
#define UART_THR (UART_BASE + 0)
#define UART RBR (UART BASE + 0)
#define UART_LSR (UART_BASE + 5)
void serial_putc(char c)
 // Wait until FIFO can hold more chars
  while( (inb(UART_LSR) & 0x20)== 0);
 // Write character to FIFO
 outb(UART_THR, c);
char serial_getc()
  // Wait until there is a char to read
 while( (inb(UART_LSR) & 0x01) == 0);
 // Read from the receive FIFO
  return inb(UART_RBR);
```

Register addresses from data sheet 0x3f8: location on a PC

Send a character (wait until we can first)

Read a character (spin waiting until one is there to read)

#### Very simple UART driver

- Actually, far too simple!
  - But this is how the first version always looks...
- No initialization code, no error handling.
- Uses Programmed I/O (PIO)
  - CPU explicitly reads and writes all values to and from registers
  - All data must pass through CPU registers
- Uses *polling* 
  - CPU polls device register waiting before send/receive
    - Tight loop!
  - Can't do anything else in the meantime
    - Although could be extended with threads and care...
  - Without CPU polling, no I/O can occur

- Each device operates differently: Different control register layouts, Unique set of commands, etc.
- Device driver hides these details from the OS: provides a

standardised API (e.g. send\_packet(), read\_block() ) for OS

• The OS (the OS's device driver) is responsible for initating DMA transfers: allocates memory, configures the DMA controller

#### Other challenges

- 1. How to avoid polling all the time?
  - How does the CPU know when the device is ready, or finished?
  - Solution: interrupts
- 2. How to avoid the CPU copying all the data?
  - Can the CPU get on with something else?
  - Solution: direct memory access (DMA)
- 3. Where do these register locations come from?
  - How can the OS find devices in the physical address space?
  - How are the physical addresses allocated?
  - Solution: discoverable buses (e.g. PCI)

## **Devices** DMA, Shared Memory, PCIe

- **DMA** (Direct Memory Access): Copies data for CPU
- Shared Memory: Buffer/Descriptor Rings
- PCIe (Peripheral Component Interconnect): Finding address space for

the devices

- **DMA** (Direct Memory Access): Copies data for CPU
- Shared Memory: Buffer/Descriptor Rings
- PCIe (Peripheral Component Interconnect): Finding address space for

the devices

#### **Direct Memory Access**

- Avoid programmed I/O for lots of data
  - E.g. fast network or disk interfaces
- Requires DMA controller
  - Generally built-in these days
- Bypasses CPU to transfer data directly between I/O device and memory
  - Doesn't take up CPU time
  - Can save memory bandwidth
  - Only one interrupt per transfer

#### Very simple DMA transfer



Systems Programming 2023 Ch. 21: Devices

### DMA and Caches

- DMA means memory becomes inconsistent with CPU caches
- Options:
  - CPU can map DMA buffers non-cacheable
     ⇒ large hit probably wants to process data anyway
  - Cache can "snoop" DMAC bus transactions (but doesn't scale beyond small SMP systems)
  - 3. OS can explicitly flush/invalidate cache regions ⇒ cache management important part of device drivers!
- Idea: add the DMA device to our cache coherency protocol

#### DMA and Virtual Memory

- DMA addresses are physical
  - Appear on external bus
- User and OS code deal with virtual addresses (mostly)
- OS (and device drivers) must manually translate virtual ↔ physical addresses when programming DMA controllers
  - This can require more than just a hardware page table!
  - DMA of a single virtual address region might not be contiguous in physical address space
  - Scatter-gather DMA controllers: DMA to/from a list of regions
- Newer systems: provide an IOMMU
  - Works like an MMU, but for DMA writes from devices
  - Must still be programmed by OS to match MMU state
  - Has all kinds of other interesting uses beyond scope of this course!

- **DMA** (Direct Memory Access): Copies data for CPU
- Shared Memory: Buffer/Descriptor Rings
- PCIe (Peripheral Component Interconnect): Finding address space for

the devices

#### Basic model



- Driver and device are both *state machines*
- Data must be transferred between them
- Events signal state transitions

#### $\mathsf{Device} \leftrightarrow \mathsf{CPU} \text{ communication}$

- 1. Writing a device register
  - CPU  $\rightarrow$  device, synchronous
- 2. Reading a device register
  - CPU  $\leftrightarrow$  device, synchronous
- 3. Device requests interrupt
  - Device  $\rightarrow$  CPU, synchronous
- 4. Shared memory
  - CPU writes to memory, DMA reads
  - DMA writes to memory, CPU reads
  - Asynchronous

Neither device nor software need to communicate simultaneously!

### Buffer (or descriptor) rings



#### Buffer (or descriptor) rings (for transmit)



#### Buffer (or descriptor) rings (for transmit)



#### Overruns and underruns (receive)

- Device has no buffers for received packets
  - $\Rightarrow$  starts discarding packets
    - Not as bad as it sounds
    - Will start copying them to memory when a buffer is free
    - Signals that it's lost some in a status register
- CPU reads all received packets  $\Rightarrow$  it must wait
  - Can spin polling, but inefficient
  - Signals device to interrupt it when a new packet has been received
  - Goes off to do something else



#### Tulip descriptors



The Tulip has 2 rings of descriptors in main memory -One for transmit, one for receive

#### Descriptor rings



### Descriptor rings – chain mode



- **DMA** (Direct Memory Access): Copies data for CPU
- Shared Memory: Buffer/Descriptor Rings
- PCIe (Peripheral Component Interconnect): Finding address space for

the devices

### PCI is...

### Peripheral Component Interconnect

- An electrical standard for connecting devices
  - As is PCMCIA, PCI-X, PCI-Express, etc.
- A standard for physical connectors
- A set of "bus protocols" for communication between devices
- A software-visible interface to I/O hardware

# PCIe has succeeded PCI, but extends the same software-visible interface

### PCI tries to solve many problems:

- Device discovery
  - Finding out which devices are in the system
- Address allocation
  - Which addresses should each device's registers appear at?
- Interrupt routing
  - Which interrupt signals from the device should map to which exception vectors?
- Intelligent DMA
  - "Bus mastering" devices no longer need a DMA controller



- Host Bridge:Connects CPU ("host") to PCI bus
  - => It controls the address

#### mapping

- PCI-PCI Bridge: Connects one PCI Bus with another one
- BFD (Bus-Device-Function):
   Unique identifier for a device on a PCI bus
- Example: BDF 00:1f.0
  - **Bus**: 00, **Device**: 1f, **Function**: 0



- Secondary: New bus created
- Subord: Highest numbered bus in secondary



- Note: The memory mapped regions from PCI for devies are for the control registers (registers != buffer/descriptor rings)
- Why a tree structure? In PCI, we have PCI buses which are broadcast
- => This yields issues such as bandwith limitation, electrical limitations (signal degredation and reflactions of noise): use multiple buses, connected via PCI-PCI bridge

### PCI devices are self-describing

- Each device has a configuration header
  - Accessed through parent bridge, initially
- Some of the fields:

| Bits | Description                                           |
|------|-------------------------------------------------------|
| 16   | Manufacturer ID (identifies Intel, 3Com, NVidia, etc. |
| 16   | Model ID (specific to manufacturer)                   |
| 24   | Class code (what kind of device is this?)             |
| 8    | Version identifier                                    |

#### • Plus:

- Allocated/required address ranges (BAR values)
- Interrupts
- Electrical information
- Etc.

Systems Programming and Computer Architecture

- 1. Programming Language C: Ch. 1-7
- 2. Assembly x86-64, Compiling, Linking and Loading: Ch. 8-15
- 3. Computer Arch, Exceptions, Virtual Memory, Devices: Ch. 16-21

- 1. Programming Language C: Ch. 1-7
- 2. Assembly x86-64, Compiling, Linking and Loading: Ch. 8-15
- 3. Computer Arch, Exceptions, Virtual Memory, Devices: Ch. 16-21

### GNU gcc Toolchain



### Control flow statements (like Java or C# or C++)

switch (Integer expression) {
 case Constant\_1: Statement; break;
 case Constant\_2: Statement; break;
 ...
 case Constant\_n: Statement; break;
 default: Statement; break;

return (Expression)

Systems Programming 2024 Ch. 2: Introduction to C

### Control flow statements (like Java or C# or C++)

for (Initial; Test; Increment) Statement

while (Boolean expression) Statement

do Statement while (Boolean expression)

### Encoding integers



• A C short is 2 bytes long:

|   | Decimal | Hex   | Binary            |
|---|---------|-------|-------------------|
| х | 15213   | 3B 6D | 00111011 01101101 |
| у | -15213  | C4 93 | 11000100 10010011 |

- Sign bit
  - For 2's complement, most significant bit = 1 indicates negative

Systems Programming 2024 Ch. 3: Representing Integers

### Loading

- When the OS loads a program, it:
  - creates an address space
  - inspects the executable file to see what's in it
  - (lazily) copies regions of the file into the right place in the address space
  - does any final linking, relocation, or other needed preparation



### Dynamic memory all

### External fragmentation

• Occurs when there is enough aggregate heap memory, but no single free block is large enough

- Memory allocator?
  - VM h/w and kernel allocate page
  - Application objects typically sma
  - Allocator manages objects withi
- Allocation
  - A memory allocator doles out m
  - "block": contiguous range of byt
    - of any size, in this context



- Depends on the pattern of future requests
  - Thus, difficult to measure

### Recall: how C code runs as a process on CPU



- 1. Programming Language C: Ch. 1-7
- 2. Assembly x86-64, Compiling, Linking and Loading: Ch. 8-15
- 3. Computer Arch, Exceptions, Virtual Memory, Devices: Ch. 16-21

### Instruction Set Architecture

- Assembly Language View
  - Processor state
    - Registers, memory, ...
  - Instructions
    - addl, movq, leal, ...
    - How instructions are encoded as bytes
- Layer of Abstraction
  - Above: how to program machine
    - Processor executes instructions in a sequence
  - Below: what needs to be built
    - Use variety of tricks to make it run fast
    - E.g., execute multiple instructions simultaneously

|     | Application<br>Program |    |
|-----|------------------------|----|
|     | Compiler               | OS |
| ISA |                        |    |
|     | CPU<br>Design          |    |
|     | Circuit<br>Design      |    |
|     | Chip<br>Layout         |    |

### Moving data

#### movx Source, Dest:

- Operand Types
  - Immediate: Constant integer data
    - Example: \$0x400, \$-533
    - Like C constant, but prefixed with `\$'
    - Encoded with 1, 2, 4, 8 bytes
  - *Register:* One of 16 integer registers
    - Example: %eax, %r14d
    - Note some (e.g. %rsp, %rbp) reserved for special use
    - Others have special uses for particular instructions
  - *Memory:* 1,2,4, or 8 consecutive bytes of memory at address given by register
    - Simplest example: (%rax)
    - Various other "address modes"

| %rax | %eax | %r8  | %r8d  |
|------|------|------|-------|
| %rbx | %ebx | %r9  | %r9d  |
| %rcx | %ecx | %r10 | %r10d |
| %rdx | %edx | %r11 | %r11d |
| %rsi | %esi | %r12 | %r12d |
| %rdi | %edi | %r13 | %r13d |
| %rsp | %esp | %r14 | %r14d |
| %rbp | %ebp | %r15 | %r15d |

#### Source code:

### Compiling into assembly

#### C code

| <pre>int sum(int x, int y)</pre> |  |
|----------------------------------|--|
| {                                |  |
| int t = x+y;                     |  |
| return t;                        |  |
| }                                |  |

Obtain with command

Produces file code.s

gcc -00 -S code.c

# () Generated x86 assembly sum:

|   | sum:    |                 |
|---|---------|-----------------|
|   | endbr64 |                 |
|   | pushq   | %rbp            |
|   | movq    | %rsp, %rbp      |
|   | movl    | %edi, -20(%rbp) |
| 1 | movl    | %esi, -24(%rbp) |
|   | movl    | -20(%rbp), %edx |
|   | movl    | -24(%rbp), %eax |
|   | addl    | %edx, %eax      |
|   | movl    | %eax, -4(%rbp)  |
|   | movl    | -4(%rbp), %eax  |
|   | рорд    | %rbp            |
|   | ret     |                 |
|   |         |                 |

### Recall: how C code runs as a process on CPU



### Object code

#### Code for sum.o

|                                                                      | <sum>:</sum>   |                   |
|----------------------------------------------------------------------|----------------|-------------------|
| • Assembler                                                          | 0: f3 0f 1e fa |                   |
| <ul> <li>Translates .s into .o</li> </ul>                            | 4: 55          |                   |
|                                                                      | 5: 48 89 e5    |                   |
| <ul> <li>Binary encoding of each instruction</li> </ul>              | 8: 89 7d ec    |                   |
| <ul> <li>Nearly-complete image of executable code</li> </ul>         | b: 89 75 e8    |                   |
| <ul> <li>Missing linkages between code in different files</li> </ul> | e: 8b 55 ec    |                   |
| 6 6                                                                  | 11: 8b 45 e8   |                   |
| • Linker                                                             | 14: 01 d0      |                   |
| <ul> <li>Resolves references between files</li> </ul>                | 16: 89 45 fc   | Total of 26 bytes |
| <ul> <li>Combines with static run-time libraries</li> </ul>          | 19: 8b 45 fc   | Each instruction  |
|                                                                      | 1c: 5d         | 1, 2, or 3 bytes  |
| <ul> <li>E.g., code for malloc, printf</li> </ul>                    | 1d: c3         | Starts at address |
|                                                                      |                |                   |

- Some libraries are dynamically linked
  - Linking occurs when program begins execution

0x401040

### Alternate disassembly

#### Within gdb debugger:

| (gdb) disassemble sum          |           |                  |
|--------------------------------|-----------|------------------|
| Dump of assembler code for fun | ction sum | n:               |
| 0x000000000000000 <+0>:        | endbre    | 54               |
| 0x000000000000004 <+4>:        | push      | %rbp             |
| 0x000000000000005 <+5>:        | mov       | %rsp,%rbp        |
| 0x000000000000008 <+8>:        | mov       | %edi,-0x14(%rbp) |
| 0x000000000000000 <+11>:       | mov       | %esi,-0x18(%rbp) |
| 0x000000000000000 <+14>:       | mov       | -0x14(%rbp),%edx |
| 0x000000000000011 <+17>:       | mov       | -0x18(%rbp),%eax |
| 0x000000000000014 <+20>:       | add       | %edx,%eax        |
| 0x000000000000016 <+22>:       | mov       | %eax,-0x4(%rbp)  |
| 0x0000000000000019 <+25>:      | mov       | -0x4(%rbp),%eax  |
| 0x00000000000001c <+28>:       | рор       | %rbp             |
| 0x00000000000001d <+29>:       | retq      |                  |
| End of assembler dump.         |           |                  |
|                                |           |                  |

### Recall: how C code runs as a process on CPU



### Static linking

• Programs are translated and linked using a *compiler driver*:

unix> gcc -O2 -g -o p main.c swap.c

unix> ./p Source files main.c swap.c Translators Translators (cpp,cc1,as) (cpp,cc1,as) Separately compiled main.o swap.o relocatable object files Linker (1d) Fully linked executable object file (contains code and data for all functions defined in main.c and swap.c

Systems Programming 2024 Ch. 12: Linking

| Disasse | mbly of section .data:                        |
|---------|-----------------------------------------------|
|         | 000000000 <bufp0>:</bufp0>                    |
| 0:      | 00 00 00 00 00 00 00 00<br>0: R_X86_64_64 buf |
| Disasse | mbly of section .bss:                         |
|         | -                                             |
| 0000000 | 000000000 <bufp1>:</bufp1>                    |
| 0:      | 00 00 00 00 00 00 00 00                       |

| push<br>mov | %rbp<br>%rsp,%rbp                    | ш | 601050 | 4 h + 1 G m           |
|-------------|--------------------------------------|---|--------|-----------------------|
| movq        | <pre>\$0x60103c,0x200b3c(%rip)</pre> | Ŧ | 601020 | <br>outp              |
| mov         | 0x200b25(%rip),%rax                  | # | 601040 | <bufp< th=""></bufp<> |
| mov         | (%rax),%eax                          |   |        |                       |
| mov         | %eax,-0x4(%rbp)                      |   |        |                       |
| mov         | 0x200b19(%rip),%rax                  | # | 601040 | <bufp< td=""></bufp<> |
| mov         | 0x200b22(%rip),%rdx                  | # | 601050 | <bufp< td=""></bufp<> |
| mov         | (%rdx),%edx                          |   |        |                       |
| mov         | %edx,(%rax)                          |   |        |                       |
| mov         | 0x200b17(%rip),%rax                  | # | 601050 | <bufp< td=""></bufp<> |
| mov         | -0x4(%rbp),%edx                      |   |        |                       |
| mov         | %edx.(%rax)                          |   |        |                       |
|             |                                      |   |        |                       |

### Linking with static libraries

### Dynamic linking at load-time



### Recall: how C code runs as a process on CPU



### Buffer overflow stack

Input: 1234567901234567901234



What about something more interesting than crashing?

### Original precisions

#### IEEE 754 Single Precision (32 bits):

| s | exp | frac |
|---|-----|------|
| 1 | 8   | 23   |

#### IEEE 754 Double Precision (64 bits):

| s | exp | frac |
|---|-----|------|
| 1 | 11  | 52   |

#### Intel Extended Precision (80 bits):

| s | exp | i frac |    |  |
|---|-----|--------|----|--|
| 1 | 15  | 1      | 63 |  |

### Optimizing compilers

- Use optimization flags, default can be no optimization (-O0)!
- Good choices for gcc: -O2, -O3, -march=xxx, -m64
- Try different flags and maybe different compilers
  - icc is often faster than gcc



### Recall: how C code runs as a process on CPU



• C Program: from source code to final executable

<mark>#</mark>include <stdio.h> #include "functions.h"

```
int main(int argc, char** argv){
    printf("hello, world\n");
    printf("square of 3: %d\n", square(3));
    return 0;
```

|         | .sectio                                          | n                   | TEXT,              | text r | egular   | nur      | e in | struct | ions       |      |
|---------|--------------------------------------------------|---------------------|--------------------|--------|----------|----------|------|--------|------------|------|
|         |                                                  |                     | macos, 15,         |        |          |          |      |        | 20110      |      |
|         | .globl                                           |                     | 110003, 10,        |        | Sur_ve   |          |      |        | functior   | mair |
|         |                                                  |                     | 4, 0x90            |        |          | <i>m</i> | π    | Degin  | 1011012101 | mari |
| _main:  | .pzaciy                                          | ] ! !               | ч, 0х70            |        | ## @ma   | ain      |      |        |            |      |
|         | cfi st                                           | artproc             |                    |        | ## @iiia | 3 1 11   |      |        |            |      |
| ## %bb. |                                                  |                     |                    |        |          |          |      |        |            |      |
|         | pushq                                            | %rhn                |                    |        |          |          |      |        |            |      |
|         |                                                  |                     | fset 16            |        |          |          |      |        |            |      |
|         | .cfi_def_cfa_offset 16<br>.cfi_offset %rbp, -16  |                     |                    |        |          |          |      |        |            |      |
|         | .cti_ottset %rbp, -io<br>movq %rsp, %rbp         |                     |                    |        |          |          |      |        |            |      |
|         |                                                  |                     | gister % <b>rb</b> | n      |          |          |      |        |            |      |
|         |                                                  |                     |                    | 4      |          |          |      |        |            |      |
|         | subq \$16,%rsp<br>movl \$0,-4(%rbp)              |                     |                    |        |          |          |      |        |            |      |
|         |                                                  | \$0, -4(<br>%edi, - |                    |        |          |          |      |        |            |      |
|         |                                                  |                     |                    |        |          |          |      |        |            |      |
|         |                                                  |                     | 16(%rbp)           |        |          |          |      |        |            |      |
|         |                                                  |                     | %rip), %rd         | 1      |          |          |      |        |            |      |
|         | movb                                             | \$0, %al<br>_printf |                    |        |          |          |      |        |            |      |
|         |                                                  |                     |                    |        |          |          |      |        |            |      |
|         |                                                  | \$3, %ed            |                    |        |          |          |      |        |            |      |
|         |                                                  | _square             |                    |        |          |          |      |        |            |      |
|         |                                                  | %eax, %             |                    |        |          |          |      |        |            |      |
|         | leaq                                             |                     | 1(%rip), %         | rdi    |          |          |      |        |            |      |
|         | movb                                             | \$0, %al            |                    |        |          |          |      |        |            |      |
|         | callq                                            | _printf             |                    |        |          |          |      |        |            |      |
|         | xorl                                             | %eax, %             | eax                |        |          |          |      |        |            |      |
|         | addq                                             | \$16, %r            | sp                 |        |          |          |      |        |            |      |
|         | popq                                             | %rbp                |                    |        |          |          |      |        |            |      |
|         | retq                                             |                     |                    |        |          |          |      |        |            |      |
|         | .cfi_en                                          | Idproc              |                    |        |          |          |      |        |            |      |
|         |                                                  |                     |                    |        | ##       | End      | fun  | ction  |            |      |
|         | <pre>.sectionTEXT,cstring,cstring_literals</pre> |                     |                    |        |          |          | rals |        |            |      |
| Lstr:   |                                                  |                     |                    |        | ## @.s   | str      |      |        |            |      |
|         | .asciz                                           | "hello,             | world <b>\n</b> "  |        |          |          |      |        |            |      |
|         |                                                  |                     |                    |        |          |          |      |        |            |      |
| Lstr.   | 1:                                               |                     |                    |        | ## @.s   | str.     | 1    |        |            |      |
|         | .asciz                                           | "square             | of 3: %d\          | n"     |          |          |      |        |            |      |
|         |                                                  |                     |                    |        |          |          |      |        |            |      |
| .subsec | tions_vi                                         | .a_symbol           | S                  |        |          |          |      |        |            |      |
| ~       |                                                  |                     |                    |        |          |          |      |        |            |      |
|         |                                                  |                     |                    |        |          |          |      |        |            |      |

|                 | + h = 150 | (         |         |
|-----------------|-----------|-----------|---------|
| student-ne      | 2         |           |         |
| 000000000000000 | 11001111  | 11111010  | 1110110 |
| 00000006:       | 000000000 | 00000001  | 0000001 |
| 0000000c:       | 00000001  | 000000000 | 0000000 |
| 00000012:       | 000000000 | 000000000 | 0000100 |
| 00000018:       | 000000000 | 00100000  | 0000000 |
| 0000001e:       | 000000000 | 00000000  | 0001100 |
| 00000024:       | 10001000  | 00000001  | 0000000 |
| 0000002a:       | 000000000 | 000000000 | 0000000 |
| 00000030:       | 000000000 | 000000000 | 0000000 |
| 00000036:       | 000000000 | 000000000 | 0000000 |
| 0000003c:       | 000000000 | 000000000 | 0000000 |
| 00000042:       | 000000000 | 000000000 | 0000000 |
| 00000048:       | 00101000  | 00000010  | 0000000 |
| 0000004e:       | 000000000 | 000000000 | 1100100 |
| 00000054:       | 000000000 | 000000000 | 0000000 |
| 0000005a:       | 000000000 | 000000000 | 0000011 |
| 00000060:       | 00000100  | 000000000 | 0000000 |
| 00000066:       | 000000000 | 000000000 | 0101111 |
| 0000006c:       | 01111000  | 01110100  | 0000000 |
| 00000072:       | 000000000 | 000000000 | 0000000 |
| 00000078:       | 01011111  | 01011111  | 0101010 |
| 0000007e:       | 000000000 | 000000000 | 0000000 |
| 00000084:       | 000000000 | 000000000 | 0000000 |
| 0000008a:       | 000000000 | 00000000  | 0000000 |
| 00000090:       | 01000110  | 00000000  | 0000000 |
| 00000096:       | 00000000  | 00000000  | 0010100 |
| 0000009c:       | 00000100  | 00000000  | 0000000 |
| 000000a2:       | 000000000 | 00000000  | 0000010 |
| 000000a8:       | 00000000  | 00000100  | 0000000 |
| 000000ae:       | 00000000  | 00000000  | 0000000 |
| 000000b4:       | 00000000  | 00000000  | 0000000 |
| 000000ba:       | 01100011  | 01110011  | 0111010 |
|                 |           |           |         |

- 1. Programming Language C: Ch. 1-7
- 2. Assembly x86-64, Compiling, Linking and Loading: Ch. 8-15
- 3. Computer Arch, Exceptions, Virtual Memory, Devices: Ch. 16-21

### Pipelined hardware









Figure 7.58 Pipelined processor with full hazard handling

#### Historical Problem: Processor-memory bottleneck





#### Other caches: AMD Warsaw (Opteron 6380)



Recall: how C code runs as a process on CPU



# Core i7 memory system\*





#### Symmetric multiprocessing (SMP)



#### Physical connections: PCI is a tree



• Devices in combination with the processor





#### Questions?

Systems Programming and Computer Architecture

#### **SPCA Exam Remarks**

Systems Programming and Computer Architecture

- Know the theory well
  - Solve old pen&paper exams
  - Solve the exercise sheets again

• You will need to know some stuff by heart (How to caluclate FP bias etc.)

#### SPCA Exam Remarks

- **C Programming**: most of the exam will be programming
- You will **not need** GDB: i.e. it will not be "bomb lab" style
- **Training**: do all the C code experts
- Some of the Labs
- 1. FP Lab (If issues with bits, then also: Bit lab)
- 2. Paging Lab
- 3. Cache, Attack and Bomb Lab: help with understanding but most likely not directly applicatble

Systems Programming and Computer Architecture

Grundlagenfächer

- **Computer Networks**: How does the internet work
- Data Modelling and Data Bases: How to store huge amount of data

Kernfächer

- **Computer Systems**: Operating- and Distributed Systems
- Compiler Design: How to build a compiler

Grundlagenfächer

- **Computer Networks**: How does the internet work
- Data Modelling and Data Bases: How to store huge amount of data

Kernfächer

- **Computer Systems**: Operating- and Distributed Systems
- Compiler Design: How to build a compiler









#### BGP announcements carry complete AS-level path information instead of distances

16

Grundlagenfächer

- **Computer Networks**: How does the internet work
- Data Modelling and Data Bases: How to store huge amount of data

Kernfächer

- **Computer Systems**: Operating- and Distributed Systems
- Compiler Design: How to build a compiler

**Data Modelling and** 

Databases: Database

systems are as complex

as operating systems

today



| dvdrental=# select title, | release_year, <sup>.</sup> | length, re | eplacement_cost from fil |
|---------------------------|----------------------------|------------|--------------------------|
| dvdrental-# where length  |                            | lacement_c | cost > 29.50             |
| dvdrental-# order by tit  | le desc;                   |            |                          |
| title                     | release_year               | length     | replacement_cost         |
| +                         |                            | +          |                          |
| West Lion                 | 2006                       | 159        | 29.99                    |
| Virgin Daisy              | 2006                       | 179        | 29.99                    |
| Uncut Suicides            | 2006                       | 172        | 29.99                    |
| Tracy Cider               | 2006                       | 142        | 29.99                    |
| Song Hedwig               | 2006                       | 165        | 29.99                    |
| Slacker Liaisons          | 2006                       | 179        | 29.99                    |
| Sassy Packer              | 2006                       | 154        | 29.99                    |
| River Outlaw              | 2006                       | 149        | 29.99                    |
| Right Cranes              | 2006                       | 153        | 29.99                    |
| Quest Mussolini           | 2006                       | 177        | 29.99                    |
| Poseidon Forever          | 2006                       | 159        | 29.99                    |
| Loathing Legally          | 2006                       | 140        | 29.99                    |
| Lawless Vision            | 2006                       | 181        | 29.99                    |
| Jingle Sagebrush          | 2006                       | 124        | 29.99                    |
| Jericho Mulan             | 2006                       | 171        | 29.99                    |
| Japanese Run              | 2006                       | 135        | 29.99                    |
| Gilmore Boiled            | 2006                       | 163        | 29.99                    |
| Floats Garden             | 2006                       | 145        | 29.99                    |
| Fantasia Park             | 2006                       | 131        | 29.99                    |
| Extraordinary Conquerer   | 2006                       | 122        | 29.99                    |
| Everyone Craft            | 2006                       | 163        | 29.99                    |
| Dirty Ace                 | 2006                       | 147        | 29.99                    |
| Clyde Theory              | 2006                       | 139        | 29.99                    |
| Clockwork Paradise        | 2006                       | 143        | 29.99                    |
| Ballroom Mockingbird      | 2006                       | 173        | 29.99                    |
| (25 rows)                 |                            |            |                          |

Grundlagenfächer

- **Computer Networks**: How does the internet work
- Data Modelling and Data Bases: How to store huge amount of data

Kernfächer

- **Computer Systems**: Operating- and Distributed Systems
- Compiler Design: How to build a compiler

#### General model of OS structure

# SPCA in perspective

#### Computer Systems (OS

Part): What is an

**Operating System: how** 

to execute processes

(Kernel, Processes,

Scheduling, IO, VM, File

Systems, Network

Stack, VMs)



#### **Computer Systems** (DS

Part): Byzantine

Agreement, Quorum

Systems, Game Theory,

Advanced Blockchain

(Bitcoin, Ethereum)





Grundlagenfächer

- **Computer Networks**: How does the internet work
- Data Modelling and Data Bases: How to store huge amount of data

Kernfächer

- **Computer Systems**: Operating- and Distributed Systems
- **Compiler Design**: How to build a compiler

**Compiler Design**: Build your own

Compiler for the OAT language

Recall: how C code runs as a process on (

#include <stdio.h>

return 0;

#include "functions.h"

int main(int argc, char\*\* argv){

printf("hello, world\n");



.section \_\_TEXT,\_\_text,regular,pure\_instructions .build\_version macos, 15, 0 sdk\_version 15, 0 .globl \_main .p2align 4, 0x90 main: ## @main .cfi\_startproc ## %bb.0: pushq %rbp .cfi\_def\_cfa\_offset 16 .cfi\_offset %rbp, -16 mova %rsp, %rbp .cfi\_def\_cfa\_register %rbp \$16, %rsp \$0, -4(%rbp) movl %edi, -8(%rbp) %rsi, -16(%rbp) mova L\_.str(%rip), %rdi lead printf("square of 3: %d\n", square(3)); \$0, %al movb \_printf callg \$3, %edi movl \_square callg %eax, %esi movl L\_.str.1(%rip), %rdi leag movb \$0, %al \_printf callg %eax, %eax xorl addq \$16, %rsp %rbp .cfi\_endproc ## -- End function \_\_TEXT, \_\_cstring, cstring\_literals \_.str: ## 0.str .asciz "hello, world\n" ## @.str.1 .asciz "square of 3: %d\n" subsections\_via\_s<u>ymbols</u> (stack pointer) Memory-mapped region for shared libraries – brk Run-time heap (created by malloc) Loaded Read/write segment from (.data, .bss) the Read-only segment executable (.init, .text, .rodata) file





Merry Christmas and all the best in the new year!

#### Overview



- Assignment 11
- Devices Recap
- Hints for Assignment 12
- Exam Strategies



### Question 1a: MSI



| n | $P_A$        | $P_B$        | Comment                            | n | State C <sub>A</sub> | Value C <sub>A</sub> | State C <sub>B</sub> | Value C <sub>B</sub> | Value M |
|---|--------------|--------------|------------------------------------|---|----------------------|----------------------|----------------------|----------------------|---------|
| 1 | mov $(X),r1$ |              | $P_A$ : r1 := (X)                  | 1 |                      |                      |                      |                      |         |
| 2 |              | mov (X), r4  | $P_B$ : r4 := (X)                  | 2 |                      |                      |                      |                      |         |
| 3 | mov $0,(X)$  |              | $P_A: (X) := 0$                    | 3 |                      |                      |                      |                      |         |
| 4 |              | mov $(X),r5$ | $P_B: \mathbf{r}5 := (\mathbf{X})$ | 4 |                      |                      |                      |                      |         |
| 5 |              | mov $20,(X)$ | $P_B: (X) := 20$                   | 5 |                      |                      |                      |                      |         |



### Question 1a: MSI



| n | $P_A$        | $P_B$        | Comment                            | n | State C <sub>A</sub> | Value C <sub>A</sub> | State C <sub>B</sub> | Value C <sub>B</sub> | Value M |
|---|--------------|--------------|------------------------------------|---|----------------------|----------------------|----------------------|----------------------|---------|
| 1 | mov $(X),r1$ |              | $P_A$ : r1 := (X)                  | 1 | S                    | 10                   | 1                    | -                    | 10      |
| 2 |              | mov (X), r4  | $P_B: \mathbf{r}4 := (\mathbf{X})$ | 2 | S                    | 10                   | S                    | 10                   | 10      |
| 3 | mov $0,(X)$  |              | $P_A: (\mathbf{X}) := 0$           | 3 | Μ                    | 0                    | I                    | _                    | 10      |
| 4 |              | mov (X), r5  | $P_B: r5 := (X)$                   | 4 | S                    | 0                    | S                    | 0                    | 0       |
| 5 |              | mov $20,(X)$ | $P_B: (X) := 20$                   | 5 | I                    | _                    | М                    | 20                   | 0       |



### Question 1b: MESI



| n | $P_A$       | $P_B$         | Comment                            | n | State C <sub>A</sub> | Value C <sub>A</sub> | State C <sub>B</sub> | Value C <sub>B</sub> | Value M |
|---|-------------|---------------|------------------------------------|---|----------------------|----------------------|----------------------|----------------------|---------|
| 1 | mov (X), r1 |               | $P_A$ : r1 := (X)                  | 1 |                      |                      |                      |                      |         |
| 2 |             | mov $(X), r4$ | $P_B: \mathbf{r}4 := (\mathbf{X})$ | 2 |                      |                      |                      |                      |         |
| 3 | mov $0,(X)$ |               | $P_A: (\mathbf{X}) := 0$           | 3 |                      |                      |                      |                      |         |
| 4 |             | mov (X), r5   | $P_B: r5 := (X)$                   | 4 |                      |                      |                      |                      |         |
| 5 |             | mov \$20,(X)  | $P_B: (X) := 20$                   |   |                      |                      |                      |                      |         |



### **Question 1b: MESI**



| n | $P_A$        | $P_B$        | Co    | Only             | n | State C <sub>A</sub> | Value C <sub>A</sub> | State C <sub>B</sub> | Value C <sub>B</sub> | Value M |
|---|--------------|--------------|-------|------------------|---|----------------------|----------------------|----------------------|----------------------|---------|
| 1 | mov $(X),r1$ |              | $P_A$ | difference       |   | E                    | 10                   | I                    | _                    | 10      |
| 2 |              | mov (X), r4  | $P_B$ | between          | 2 | S                    | 10                   | S                    | 10                   | 10      |
| 3 | mov $0,(X)$  |              | $P_A$ | MESI and         | 3 | Μ                    | 0                    | 1                    | -                    | 10      |
| 4 |              | mov (X), r5  | $P_B$ | MSI              | 4 | S                    | 0                    | S                    | 0                    | 0       |
| 5 |              | mov $20,(X)$ | $P_B$ | $(\Lambda) = 20$ | 5 | 1                    | _                    | Μ                    | 20                   | 0       |

## Question 2a



- 3 cores
- 128B direct-mapped write-back cache, 8B cache line size
- X = 0xA0C0

Q2a: Which cache line will be used by the three processors?

- 8B line size ? CO = 3 bits
- 128B direct mapped 2 16 sets/lines total 2 CI = 4 bits
- CI = bits [6:3]
- 0xA0C0 = 0b1010 0000 1100 0000



# Question 2b

- 3 cores
- 128B direct-mapped write-back cache, 8B cache line size
- X = 0xA0C0
- Two-byte loads and stores in order:

| n | $P_1$     | $P_2$       | $P_3$        |
|---|-----------|-------------|--------------|
| 1 | ld r1,[X] |             |              |
| 2 | add r1,1  |             |              |
| 3 | st r1,[X] |             |              |
| 4 |           | ld r2,[X+2] |              |
| 5 |           | and r2,0FH  |              |
| 6 |           | st r2,[X+2] |              |
| 7 |           |             | ld r3,[X+6]  |
| 8 |           |             | sub r3,r1,r5 |
| 9 |           |             | st r3,[X+6]  |



| n | State C1 | State C2 | State C3 | Remarks |
|---|----------|----------|----------|---------|
| 1 |          |          |          |         |
| 2 |          |          |          |         |
| 3 |          |          |          |         |
| 4 |          |          |          |         |
| 5 |          |          |          |         |
| 6 |          |          |          |         |
| 7 |          |          |          |         |
| 8 |          |          |          |         |
| 9 |          |          |          |         |

# Question 2b

- 3 cores
- 128B direct-mapped write-back cache, 8B cache line size
- X = 0xA0C0
- Two-byte loads and stores in order:

| n | $P_1$     | $P_2$       | $P_3$        |
|---|-----------|-------------|--------------|
| 1 | ld r1,[X] |             |              |
| 2 | add r1,1  |             |              |
| 3 | st r1,[X] |             |              |
| 4 |           | ld r2,[X+2] |              |
| 5 |           | and r2,0FH  |              |
| 6 |           | st r2,[X+2] |              |
| 7 |           |             | ld r3,[X+6]  |
| 8 |           |             | sub r3,r1,r5 |
| 9 |           |             | st r3,[X+6]  |



| n | State C1 | State C2 | State C3 | Remarks                                |
|---|----------|----------|----------|----------------------------------------|
| 1 | S        | I        | I        |                                        |
| 2 | S        | I        | I        |                                        |
| 3 | Μ        | T        | I        |                                        |
| 4 | S        | S        | I        | P1 writes back, P2 waits for writeback |
| 5 | S        | S        | I        |                                        |
| 6 | I        | Μ        | I        |                                        |
| 7 | I        | S        | S        | P2 writes back, P3 waits for writeback |
| 8 | I        | S        | S        |                                        |
| 9 | T        | T        | Μ        |                                        |





#### Devices



#### How to access devices?

- Memory mapped
  - Write & Read to devices as you have done with normal memory (mov...)
  - Use of the MMU & Page
    - Tables



### How to access devices?

- IO Ports
  - Special address space (16 bit)
  - Write to an IO port
  - special instructions <sub>0</sub>,
     inb, outb





### Memory Mapped Devices



- Device represented as [Base Address, Length]
  - Base address refers to the physical address where the device starts
    - cf: your beginning of the stack frame
    - agreed upon by OS and device driver upon device startup
  - Length refers to the total size of the memory region occupied by the device
    - cf: the size of your stack frame
    - includes devices registers and if required descriptor ring
  - Set of registers within Base, Base+Length
    - cf: your variables on the stack
    - used to set control bits etc.

### Devices are **NOT** memory



- Contents of device registers may change unexpectedly from view of CPU
  - Data received

- Writing to a register may trigger actions
  - Shutdown device/machine
  - Perform reset

### ns16550 registers (each 8 bits)



| Addr. | Name | Description                                   | Notes  |
|-------|------|-----------------------------------------------|--------|
| 0     | RBR  | Receive Buffer Register (read only)           | DLAB=0 |
| 0     | THR  | Transmit Holding Register (write only)        | DLAB=0 |
| 1     | IER  | Interrupt Enable Register                     | DLAB=0 |
| 2     | IIR  | Interrupt Identification Register (read only) |        |
| 2     | FCR  | FIFO Control Register (write only)            |        |
| 3     | LCR  | Line Control Register                         |        |
| 4     | MCR  | MODEM Control Register                        |        |
| 5     | LSR  | Line Status Register                          |        |
| 6     | MSR  | MODEM Status Register                         |        |
| 7     | SCR  | Scratch Register                              |        |
| 0     | DLL  | Divisor Latch (LSB)                           | DLAB=1 |
| 1     | DLM  | Divisor Latch (MSB)                           | DLAB=1 |

### Too simple UART driver



```
1. #define UART BASE 0x12345000
2. #define UART_THR (UART_BASE + 0)
3. #define UART_RBR (UART_BASE + 4)
4. #define UART_LSR (UART_BASE + 8)
1. void serial putc(char c)
2. {
    char *lsr = (char *) UART LSR;
3.
   char *thr = (char *) UART_THR;
4.
    // Wait until FIFO can hold more chars
1.
    while((*lsr & 0x20)== 0);
2.
3.
   *thr = c
4.
5. }
```

#### What's the problem here?

### Too simple UART driver



```
1. #define UART_BASE 0x12345000
2. #define UART_THR (UART_BASE + 0)
3. #define UART_RBR (UART_BASE + 4)
4. #define UART_LSR (UART_BASE + 8)
1. void serial_putc(char c)
2. {
3. volatile char *lsr = (char *) UART_LSR;
4. volatile char *thr = (char *) UART_THR;
1. // Wait until FIFO can hold more chars
```

```
1. // wait until FIFO can hold more chars
2. while((*lsr & 0x20)== 0);
3.
4. *thr = c
5. }
```

What's the problem here?

- Compiler does not know this is a device register!
- Loop gets optimized away!
- Add: volatile keyword

Again: this volatile is not the same as in java

### **Device Register Contents**



- Each bit may have a different meaning
- Lots of configurations possible!



### **Device Drivers**



- Writing device drivers is tedious
  - Setting a single bit wrong and the device does not work
  - Debugging is hard: Likely to set a bit wrong due to a wrong shift & mask
  - Device manuals not always available



### **Devices and Caches**



- Device registers cannot be cached due to inconsistency problem, i.e. register content changes without CPU write!
- What about cache lines? Would overwrite other register values when writing back
- Set the "no-cache" flag in the page table entry

### Interrupts

```
1. #define UART BASE 0x12345000
2. #define UART_THR (UART_BASE + 0)
3. #define UART_RBR (UART_BASE + 4)
4. #define UART_LSR (UART_BASE + 8)
1. void serial putc(char c)
2.
  {
    volatile char *lsr = (char *) UART LSR;
3.
    volatile char *thr = (char *) UART THR;
4.
    // Wait until FIFO can hold more chars
1.
    while((*lsr & 0x20)== 0); -
2.
3.
    *thr = c
4.
5. }
```



What's the problem with this code?

#### Efficiency problem: Polling...

 CPU can't do any useful work while busy waiting

Solution: Use **Interrupts**, i.e. tell device to "call this function when data ready"



### Data transfer



- Don't want to waste CPU cycles just copying data
- Use Direct Memory Access instead
- Now CPU and DMA can work in parallel
- need to make sure we maintain consistency with CPU caches!

## Direct Memory Access (DMA)





### **DMA and Caches**

- DMA is like device registers! Changes the contents w/o CPU write.
- Cannot "no-cache": Bad performance since large data
- Need to explicitly invalidate cache!





### DMA Addressing



Deal with physical addresses only!

– Any problems with that?

- Programs deal with virtual addresses: need translation!
- Physical Range may not be contiguous
  - Scatter-gather DMA controllers: DMA to/from a list of regions

• How about security?



### **DMA and Cache Problems**



- On DMA read:
  - Before: Flush (= write back all) the cache to update main memory
  - After: Invalidate (= clear all) the cache to avoid old values seen
- On DMA write:
  - Before: flush or invalidate the cache to update main memory
  - After: invalidate CPU cache

### **Buffer/Descriptor Rings**



**Actual View in Memory:** Logical View: ← Tail wraps around ← Tail  $\leftarrow$  Head 7 Head

### **Buffer/Descriptor Rings**



#### **Buffer Ring**

- ring holds metadata and actual data
- data must be fixed size
- data is contigous

#### **Descriptor Ring**

- ring holds metadata and pointer (descriptor) to actual data
- data can be variable sized
- data doesn't have to be contigous
- => can make descriptor chain

### X-Max Assignment 12





# Systems @ ETH zürich

#### An unknown MPFC (?) device appeared



[] Ignore [x] Write Driver



• You will write a device driver for a MPFC device

- Device Specs
  - Uses single descriptor ring
  - Uses DMA transfers
  - Uses interrupts for "new item" or "no buffer" signals
  - Memory Mapped Device

### Assignment TO-DO



- Initialize the device (reset) and setting up the in memory data structures
- Start the device
- Start issuing DMA requests to the device
- Activate interrupt and go to sleep
- Do not terminate! Hand back the buffers to the device once used

### Hints: Interrupts



• You will need to register a handler which is called for received interrupts

- Interrupt received:
  - Wake the sleeping thread
  - Acknowledge the interrupt



• Allocate enough memory

• Keep track of who owns the buffer

- Tell the device where to find the buffer rings!
  - (Normally you would have to give the physical address, but we stay virtual this time)



typedef struct {
 size\_t size;
 void \*buffer;
 char owned;
} mpfc\_desc;

For receive: Driver = read-pointer Device = write-pointer







- Owned by driver
- Owned by device









- Owned by driver
- Owned by device







- Overruns and underruns (receive)
  - Device has no buffers for received packets
    - Then it starts discarding packets and signals that to CPU
  - CPU reads all received packets
    - Spin poll or tell device to do an interrupt when more data ready
- Overruns and underruns (transmit)
  - Device has no more packets to send
    - It must wait and either poll memory or tell CPU to wake it up
  - CPU has no more slots to send packets
    - Must wait until more slots are free, again either polling or tell the device to send interrupt

### Semester Recap





## Systems @ ETH zurich

### Exam Strategy

- 1 point  $\approx$  1 minute
- read through all the exam first => start with what you know!
- practice C coding
  - C intuition is built through practice
  - labs, code expert, old exercises in C, AoC in C, etc ...
- Stay calm :)





### Merry Christmas and all the best in the new year!