



UC Berkeley Teaching Professor Dan Garcia

### Great Ideas in Computer Architecture (a.k.a. Machine Structures)

656

### Thread-Level Parallelism I



cs61c.org



### **UC Berkeley** Professor Bora Nikolić





# Porolle Computer Architectures







# Improving Performance

- Increase clock rate  $f_s$ 1.
  - Reached practical maximum for today's technology
  - $\sim$  < 5GHz for general purpose computers
- 2. Lower CPI (cycles per instruction)
  - SIMD, "instruction level parallelism"

- Perform multiple tasks simultaneously 3.
  - Multiple CPUs, each executing different program
  - Tasks may be related
    - E.g. each CPU performs part of a big matrix multiplication
  - or unrelated
    - E.g. distribute different web http requests over different computers
    - E.g. run pptx (view lecture slides) and browser (youtube) simultaneously
- 4. Do all of the above:
  - High  $f_s$ , SIMD, multiple parallel tasks



Thread-Level Parallelism I (3)



### Today's lecture





### **New-School Machine Structures**

Software Parallel Pequests Assigned to computer e.g., Search "Cats"

Parallel Threads Assigned to core e.g., Lookup, Ads

Parallel Instructions >1 instruction @one time e.g., 5 pipelined instructions Parallel Data >1 data item @one time

e.g., Add of 4 pairs of words

Hardware descriptions All gates work in parallel at same time





Thread-Level Parallelism I (4)



### Parallel Computer Architectures



Several separate computers, some means for communication (e.g., Ethernet)

GPU "graphics processing unit"

Multi-core CPU: 1 datapath in single chip share L3 cache, memory, peripherals <u>Example</u>: Hive machines





Massive array of computers, fast communication between processors



Thread-Level Parallelism I (5)



(\$)()



### Example: CPU with Two Cores







### **Multiprocessor Execution Model**

- Each processor (core) executes its own instructions
- Separate resources (not shared)
  - Datapath (PC, registers, ALU)
  - Highest level caches (e.g., 1<sup>st</sup> and 2<sup>nd</sup>)
- *Shared* resources
  - Memory (DRAM)
  - Often 3<sup>rd</sup> level cache
    - Often on same silicon chip
    - But not a requirement
- Nomenclature
  - "Multiprocessor Microprocessor"
  - Multicore processor
    - E.g., four core CPU (central processing unit)
    - Executes four different instruction streams simultaneously









# Multicore





### Transition to Multicore





Thread-Level Parallelism I (9)

Transistors (Thousands)

### Sequential App Performance

Frequency (MHz)

Typical Power (Watts)

Garcia, Nikolić \$0 NC SA BY



### Transition to Multicore





Thread-Level Parallelism I (10)

Transistors (Thousands)

Parallel App Performance

### Sequential App Performance

Frequency (MHz)

Typical Power (Watts)

Number of Cores





### Apple A14 Chip (in their latest phones)





Thread-Level Parallelism I (11)







### **Multiprocessor Execution Model**

### Shared memory

- Each "core" has access to the entire memory in the processor
- Special hardware keeps caches consistent (next lecture!)
- Advantages:
  - Simplifies communication in program via shared variables
- Drawbacks:
  - Does not scale well:
    - "Slow" memory shared by many "customers" (cores) •
    - May become bottleneck (Amdahl's Law)
- Two ways to use a multiprocessor:
  - Job-level parallelism
    - Processors work on unrelated problems
    - No communication between programs
  - Partition work of single task between several cores
    - E.g., each performs part of large matrix multiplication









## Parallel Processing

- It's difficult!
- It's inevitable
  - Only path to increase performance
  - Only path to lower energy consumption (improve battery life)
- In mobile systems (e.g., smart phones, tablets)
  - Multiple cores
  - Dedicated processors, e.g.,
    - Motion processor, image processor, neural processor in iPhone 8 + X
    - GPU (graphics processing unit)
- Warehouse-scale computers (next week!)
  - Multiple "nodes"
    - "Boxes" with several CPUs, disks per box
  - MIMD (multi-core) and SIMD (e.g. AVX) in each node







### Potential Parallel Performance

### (assuming software can use it)





|      | Total, e.ç |     |   |
|------|------------|-----|---|
| S    | FLOPs/Cy   | cle |   |
| 256  | MIMD       | 4   |   |
| 512  | & SIMD     | 8   |   |
| 768  |            | 12  |   |
| 024  |            | 16  |   |
| 560  |            | 40  |   |
| 072  |            | 48  |   |
| '168 | 20X        | 112 |   |
| 192  |            | 128 |   |
| 432  |            | 288 |   |
| 480  |            | 320 |   |
|      |            |     | V |



# Threads

# Programs Running on a typical Computer

| PID TTY | TIME CMD                                        |                                                                                                                  |
|---------|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| 220 ??  | 0:04.34 /usr/libexec/UserEventAgent (Aqua)      | unix%                                                                                                            |
| 222 ??  | 0:10.60 /usr/sbin/distnoted agent               |                                                                                                                  |
| 224 ??  | 0:09.11 /usr/sbin/cfprefsd agent                |                                                                                                                  |
| 229 ??  | 0:04.71 /usr/sbin/usernoted                     |                                                                                                                  |
| 230 ??  | 0:02.35 /usr/libexec/nsurlsessiond              |                                                                                                                  |
| 232 ??  | 0:28.68 /System/Library/PrivateFrameworks/      | CalendarAgent.framework/Executables/CalendarAge                                                                  |
| 234 ??  | 0:04.36 /System/Library/PrivateFrameworks/      | GameCenterFoundation.framework/Versions/A/gamed                                                                  |
| 235 ??  | 0:01.90 /System/Library/CoreServices/cloud      | photosd.app/Contents/MacOS/cloudphotosd                                                                          |
| 236 ??  | 0:49.72 /usr/libexec/secinitd                   |                                                                                                                  |
| 239 ??  | 0:01.66 /System/Library/PrivateFrameworks/      |                                                                                                                  |
| 240 ??  | 0:12.68 /System/Library/Frameworks/Account      | s.framework/Versions/A/Support/accountsd                                                                         |
| 241 ??  | 0:09.56 /usr/libexec/SafariCloudHistoryPus      |                                                                                                                  |
| 242 ??  | 0:00.27 /System/Library/PrivateFrameworks/      | CallHistory.framework/Support/CallHistorySyncHe                                                                  |
| 243 ??  | 0:00.74 /System/Library/CoreServices/mapsp      | bushd                                                                                                            |
| 244 ??  | 0:00.79 /usr/libexec/fmfd                       |                                                                                                                  |
| 246 ??  |                                                 | AskPermission.framework/Versions/A/Resources/as                                                                  |
| 248 ??  |                                                 | CloudDocsDaemon.framework/Versions/A/Support/b:                                                                  |
| 249 ??  |                                                 | IDS.framework/identityservicesd.app/Contents/Ma                                                                  |
| 250 ??  | 0:04.81 /usr/libexec/secd                       |                                                                                                                  |
| 254 ??  | 0:24.01 /System/Library/PrivateFrameworks/      |                                                                                                                  |
| 258 ??  |                                                 | TelephonyUtilities.framework/callservicesd                                                                       |
| 267 ??  |                                                 | ayUIAgent.app/Contents/MacOS/AirPlayUIAgent                                                                      |
| 271 ??  | 0:03.91 /usr/libexec/nsurlstoraged              |                                                                                                                  |
| 274 ??  |                                                 | CommerceKit.framework/Versions/A/Resources/stor                                                                  |
| 282 ??  | 0:00.09 /usr/sbin/pboard                        |                                                                                                                  |
| 283 ??  |                                                 | InternetAccounts.framework/Versions/A/XPCServic                                                                  |
|         | ternetaccounts.xpc/Contents/MacOS/com.apple.int |                                                                                                                  |
| 285 ??  |                                                 | tionServices.framework/Frameworks/ATS.framework                                                                  |
| 291 ??  |                                                 | y.framework/Versions/A/Resources/CloudKeychain                                                                   |
|         | OS/CloudKeychainProxy                           |                                                                                                                  |
| 292 ??  |                                                 | ervicesUIAgent.app/Contents/MacOS/CoreServicesU                                                                  |
| 293 ??  |                                                 | CloudPhotoServices.framework/Versions/A/Framework                                                                |
|         |                                                 | vices/com.apple.CloudPhotosConfiguration.xpc/Co                                                                  |
|         | oudPhotosConfiguration                          |                                                                                                                  |
| 297 ??  |                                                 | CloudServices.framework/Resources/com.apple.sbo                                                                  |
| 302 ??  | 0:26.11 /System/Library/CoreServices/Dock.      |                                                                                                                  |
| 303 ??  | 0:09.55 /System/Library/CoreServices/Syste      | mUIServer.app/Contents/MacOS/SystemUIServer                                                                      |
|         |                                                 |                                                                                                                  |
|         |                                                 | and the second |

### ... 156 total at this moment... How does my laptop do this? Imagine doing 156 assignments all at the same time!



Thread-Level Parallelism I (16)



Helper

askpermissiond ird MacOS/identityservicesd

-launchd

preaccountd

.ces/

k/Support/fontd nProxy.bundle/

UIAgent orks/ contents/MacOS/





- A *Thread* stands for "thread of execution", is a single stream of instructions
  - A program / process can split, or fork itself into separate threads, which can (in theory) execute simultaneously.
  - An easy way to describe/think about parallelism
- With a single core, a single CPU can execute many threads by *Time Sharing* CPU



Time

Thread-Level Parallelism I (17)



### EXECUTE Thread<sub>0</sub> Thread<sub>1</sub> Thread<sub>2</sub>





- Sequential flow of instructions that performs some task
  - Up to now we just called this a "program"
- Each thread has:
  - Dedicated PC (program counter)
  - Separate registers
  - Accesses the shared memory
- Each physical core provides one (or more)
  - Hardware threads that actively execute instructions
  - Each executes one "hardware thread"
- Operating system multiplexes multiple
  - *Software* threads onto the available hardware threads
  - All threads except those mapped to hardware threads are waiting



Thread-Level Parallelism I (18)



# Thoughts about Threads

"Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly non-deterministic, and the job of the programmer becomes one of pruning that nondeterminism."

— The Problem with Threads, Edward A. Lee, UC Berkeley, 2006



Thread-Level Parallelism I (19)







Give illusion of many "simultaneously" active threads

- Multiplex software threads onto hardware threads: 1
  - Switch out blocked threads (e.g., cache miss, user input, a) network access)
  - Timer (e.g., switch active thread every 1 ms) b)
- 2. Remove a software thread from a hardware thread by
  - Interrupting its execution a)
  - Saving its registers and PC to memory b)
- 3. Start executing a different software thread by
  - Loading its previously saved registers into a hardware thread's a) registers
  - Jumping to its saved PC b)







### **Example: Four Cores**



### Each "Core" actively runs one instruction stream at a time



Thread-Level Parallelism I (21)





# Multithreading





## Multithreading

- Typical scenario:
  - Active thread encounters cache miss
  - $\,\,{}_{\,\,\circ}\,\,$  Active thread waits  $\,\sim\,$  1000 cycles for data from DRAM
  - ightarrow switch out and run different thread until data available
- Problem
  - Must save current thread state and load new thread state
    - PC, all registers (could be many, e.g. AVX)
  - $\rightarrow$  must perform switch in  $\ll$  1000 cycles
- Can hardware help?
  - Moore's Law: transistors are plenty



### from DRAM ta available

# ew thread state





# Hardware Assisted Software Multithreading

| Processor (1 Core, 2 Threads) |             | Men | orv |       |
|-------------------------------|-------------|-----|-----|-------|
| Control                       |             |     |     |       |
| Datapath                      |             |     |     |       |
| PC 0                          | PC 1        |     |     |       |
| Registers 0                   | Registers 1 |     |     |       |
| (ALL                          | L (L        |     |     | Bytes |
|                               |             |     |     |       |

- Two copies of PC and Registers inside processor hardware
- Looks identical to two processors to software  $\bullet$ (hardware thread 0, hardware thread 1)
- Hyper-Threading: •
  - Both threads can be active simultaneously ullet



Thread-Level Parallelism I (24)





## Hyper-Threading



- Simultaneous Multithreading (HT): Logical CPUs > Physical CPUs
  - Run multiple threads at the same time per core
  - Each thread has own architectural state (PC, Registers, etc.)
  - Share resources (cache, instruction unit, execution units)
  - See <a href="http://dada.cs.washington.edu/smt/">http://dada.cs.washington.edu/smt/</a>



Thread-Level Parallelism I (25)





## Multithreading

- Logical threads
  - $\sim \approx 1\%$  more hardware
  - $\sim \sim 10\%$  (?) better performance
    - Separate registers
    - Share datapath, ALU(s), caches
- Multicore
  - Duplicate Processors
  - $\sim \approx 50\%$  more hardware
  - $\sim 2X$  better performance?
- Modern machines do both
  - Multiple cores with multiple threads per core



Thread-Level Parallelism I (26)





## Dan's Laptop (cf Activity Monitor)

### \$ sysctl hw

# hw.physicalcpu: hw.logicalcpu:

# 4 Cores 8 Threads total



Thread-Level Parallelism I (27)

4

8





### Intel® Xeon® W-3275M Processor

### **Technical Specifications**

| ssentials         |             |                    |
|-------------------|-------------|--------------------|
| ertical Segment   | Workstation | Product Collection |
| rocessor Number 🚯 | W-3275M     | Status             |
| aunch Date 🤨      | Q2'19       | Lithography 🚺      |
|                   |             |                    |

Performance

| # of Cores 🤨                                                         | 28       |                                                                                           | # of Threads 🤨        |
|----------------------------------------------------------------------|----------|-------------------------------------------------------------------------------------------|-----------------------|
| Processor Base Frequency                                             | 2.50 GHz |                                                                                           | Max Turbo Frequency 🤢 |
| Cache 🚯                                                              | 38.5 MB  |                                                                                           | Bus Speed 🚯           |
| Intel® Turbo Boost Max<br>Technology 3.0 Frequency <sup>†</sup><br>3 | 4.60 GHz | https://www.intel.com/content/<br>/en/products/processors/xeon<br>processors/w-3275m.html |                       |
| TDP 🕄                                                                | 205 W    |                                                                                           |                       |



Thread-Level Parallelism I (28)



Intel® Xeon® W Processor

### Launched

14 nm

56

4.40 GHz

8 GT/s

ww/us

× Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload. Refer to Datasheet for thermal solution requirements.





### Example: 6 Cores, 24 Logical Threads

| 0:04.34                         | <pre>/usr/libexec/UserEventAgent (Aqua)</pre> |
|---------------------------------|-----------------------------------------------|
| 0:10.60                         | /usr/sbin/distnoted agent                     |
| 0:09.11                         | /usr/sbin/cfprefsd agent                      |
| 0:04.71                         | /usr/sbin/usernoted                           |
| 0:02.35                         | /usr/libexec/nsurlsessiond                    |
| 0:28.68                         | /System/Library/PrivateFrameworks/Calend      |
| 0:04.36                         | /System/Library/PrivateFrameworks/GameCe      |
| 0:01.90                         | /System/Library/CoreServices/cloudphotos      |
| 0:49.72                         | /usr/libexec/secinitd                         |
| 0:01.66                         | /System/Library/PrivateFrameworks/TCC.fr      |
| 0:12.68                         | /System/Library/Frameworks/Accounts.fram      |
|                                 | /usr/libexec/SafariCloudHistoryPushAgent      |
| 0:00.27                         | /System/Library/PrivateFrameworks/CallHi      |
| · 이상 · 이상 가격해 영화된 것이다. ( 2016년) | /System/Library/CoreServices/mapspushd        |
|                                 | /usr/libexec/fmfd                             |
|                                 |                                               |

<u>Thread pool</u>: List of threads competing for processor

OS maps threads to cores and schedules logical (software) threads

| Core 1   | Core 2   | Core 3   | Core 4   | Cor  |
|----------|----------|----------|----------|------|
| Thread 1 | Thread 1 | Thread 1 | Thread 1 | Thre |
| Thread 2 | Thread 2 | Thread 2 | Thread 2 | Thre |
| Thread 3 | Thread 3 | Thread 3 | Thread 3 | Thre |
| Thread 4 | Thread 4 | Thread 4 | Thread 4 | Thre |

4 Logical threads per core (hardware) thread



Thread-Level Parallelism I (29)

re 5 ead 1 ead 2

ead 3

ead 4

Core 6

Thread 1

Thread 2

Thread 3

Thread 4





### **Review:** Definitions

### Thread Level Parallelism

- Thread: sequence of instructions, with own program counter and processor state (e.g., register file)
- Multicore:
  - Physical CPU: One thread (at a time) per CPU, in software OS switches threads typically in response to I/O events like disk read/write
  - Logical CPU: Fine-grain thread switching, in hardware, when thread blocks due to cache miss/memory access
  - Hyper-Threading aka Simultaneous Multithreading (SMT): Exploit superscalar architecture to launch instructions from different threads at the same time!



Thread-Level Parallelism I (30)





## And, in Conclusion, ...

- Sequential software execution speed is limited
  - Clock rates flat or declining
- Parallelism the only path to higher performance
  - SIMD: instruction level parallelism
    - Implemented in all high perf. CPUs today (x86, ARM, ...)
    - Partially supported by compilers
    - 2X width every 3-4 years
  - MIMD: thread level parallelism
    - Multicore processors
    - Supported by Operating Systems (OS)
    - Requires programmer intervention to exploit at single program level (we see later)
    - Add 2 cores every 2 years (2, 4, 6, 8, 10, ...)
      - Intel Xeon W-3275: 28 Cores, 56 Threads
  - SIMD & MIMD for maximum performance
- Key challenge: craft parallel programs with high performance on multiprocessors as # of processors increase - i.e., that scale
  - Scheduling, load balancing, time for synchronization, overhead communication





