What is the significance of the phrase “Assume that the Writeback stage writes to the register file (RF) in the first half of a cycle, while the Decode stage reads from the register file in the second half of a cycle” from Section 3? Give an example of how the register file access order can affect the number of RAW stall cycles.
The order in which the register file is accessed by pipeline stages affects the number of RAW stall cycles that can occur in a pipeline.
If the Writeback stage writes to the register file in the second half of a cycle, while the Decode stage reads from the register file in the first half of a cycle. In this case, if an instruction in the Decode stage requires data that has been modified by an instruction in the Writeback stage, a RAW hazard will occur, and a stall cycle will be inserted to resolve the dependency.
Are WAW hazards possible in a simple 5-stage pipeline? Explain your answer.
WAW hazards are not possible because all instructions modifying the register file reach the WB stage in process order.
What is the difference between a data hazard and a structural hazard?
Data hazards refers to situations where instructions in a pipeline exhibit data dependence and executing different stages of the pipeline may break that dependence, such as RAW, WAW, WAR.
Structural hazard occurs when two or more instructions in a pipeline require the same hardware resource, such as a memory or a register resources.
What is the difference between a dependence and a hazard?
Dependence refers to a relationship between two instructions, where one instruction needs the result of the other instruction to execute.
Hazard refers to a situation where the pipeline execution of instructions is disrupted due to
Provide a high-level algorithmic solution for each question of Section 3, along with pipeline diagrams. Think of a series of instructions (e.g., i1
to ix
), and specify the necessary conditions for a hazard to occur (e.g., data dependence, distance of instructions in the pipeline). Provide examples for all possible stall scenarios; for example, the conditions will be different for a one-cycle stall and a two-cycles stall. We provide empty pipeline diagrams in the appendix. Feel free to print multiple copies of that page and fill-in the tables as needed. This is an important step. Not only will it help you brainstorm, but it will also help the TA identify any issues with your logic.
Q1: What is the slowdown in performance due to read-after-write (RAW) hazards considering our 5-stage pipeline with NO forwarding or bypassing? Assume an ideal memory system and ignore control hazards.
Cycles | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Instr. | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | c10 |
LW R1 (0)R2 |
F | D | EX | M | WB | |||||
ADD R3, R1, #1 |
F | D | * | * | EX | M | WB | |||
ADD R4, R1, #1 |
F | * | * | D | EX | M | WB | |||
ADD R5, R1, #1 |
F | D | EX | M | WB | |||||
ADD R6, R1, #1 |
F | D | EX | M |
It’s possible to have 2 cycles of stall when D stage need register which is the result in previous calculations or load/store.
Cycles | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Instr. | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | c10 |
LW R1 (0)R2 |
F | D | EX | M | WB | |||||
nop |
F | D | EX | M | WB | |||||
ADD R4, R1, #1 |
F | D | * | EX | M | WB | ||||
ADD R5, R1, #1 |
F | * | D | EX | M | WB | ||||
ADD R6, R1, #1 |
F | D | EX | M | WB |
It’s possible to have 1 cycle stall when D stage need register which the the result of the calculations or load/store two cycles before
Q2: Now, consider a 6-stage pipeline where Execute is broken into 2 stages, EX1 and EX2, making the full pipeline: F, D, EX1, EX2, M, WB. This design has full forwarding and bypassing. Any inputs needed by Execute must be available at the start of EX1. Both stages (EX1 and EX2) are needed to compute the final result. What is the slowdown in performance due to read-after-write (RAW) hazards for this design? Assume an ideal memory system and ignore control hazards.
Cycles | |||||||||
---|---|---|---|---|---|---|---|---|---|
Instr. | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 |
LW R1 (0)R2 |
F | D | EX1 | EX2 | M | WB | |||
ADD R3, R1, #1 |
F | D | * | * | EX1 | EX2 | M | WB | |
ADD R4, R1, #1 |
F | * | * | D | EX1 | EX2 | M |
It’s possible to have 2 cycles of stall when LW
instruction is followed by a instruction which uses the register LW
is loaded into.
However, if the next instruction of LW
is a SW
that store the exact register LW
is loading into, then the processor doesn’t have to stall.
Cycles | |||||||||
---|---|---|---|---|---|---|---|---|---|
Instr. | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 |
LW R1 (0)R2 |
F | D | EX1 | EX2 | M | WB | |||
SW R1, (0)R3 |
F | D | EX1 | EX2 | M | WB |
Cycles | |||||||||
---|---|---|---|---|---|---|---|---|---|
Instr. | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 |
LW R1 (0)R2 |
F | D | EX1 | EX2 | M | WB | |||
nop |
F | D | EX1 | EX2 | M | WB | |||
ADD R4, R1, #1 |
F | D | * | EX1 | EX2 | M | WB |
It’s possible to have 2 cycles of stall when LW
instruction is followed by a nop instruction and a instruction which uses the register LW
is loaded into, i.e. one immediate instruction between two instruction.
Cycles | |||||||||
---|---|---|---|---|---|---|---|---|---|
Instr. | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 |
ADD R1, R2, #1 |
F | D | EX1 | EX2 | M | WB | |||
ADD R3, R1, #1 |
F | D | * | EX1 | EX2 | M | WB | ||
ADD R4, R1, #1 |
F | * | D | EX1 | EX2 | M | WB |
It’s possible to have 1 cycles of stall when other arithmetic instruction is followed by a instruction which uses the result from the ALU unit
Why do we use predictors with 2-bit saturating counters (i.e., bimodal)? Contrast this with the 1-bit approach.
The basic idea with 2-bit saturating counters is to change the prediction after two consecutive mispredictions rather than after every misprediction, as in the 1-bit predictor.
The main contribution of 2-bit saturating counter is shown in the following nested loop.
for (i=0 ; i<100; i++)
for (j=0; j<3; j++)
...
When execution exits the inner loop, the prediction remains “taken,” (go into the for loop) and the branch is predicted correctly when the inner loop is re-entered from the outer loop. Whereas 1 bit approach will predict the inner loop as “not taken” (skip for loop) when the inner loop is entered again.
You are given the following code snippet, with two conditional branches clearly marked:
int a;
for (int i = 0; i < 100000; i++) { // conditional branch B1
if ((i % 4) == 0) { // conditional branch B2
a = 10;
a = 15;
}
}
Write down the taken/not-taken sequence for the second conditional branch B2. List any assumptions you make about how this if-statement translates in assembly.
Time | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
T/N | N | T | T | T | N | T | T | T | N |
The following is the assembly code compiled using ssbig-na-sstrix-gcc
using flag -O0
$L5:
lw $2,20($fp)
andi $3,$2,0x0003
bne $3,$0,$L6
li $2,0x0000000a # 10
sw $2,16($fp)
li $2,0x0000000f # 15
sw $2,16($fp)
$L6:
$L4:
lw $3,20($fp)
addu $2,$3,1
move $3,$2
sw $3,20($fp)
j $L2
The condition (i % 4) == 0
is implemented using andi $3,$2,0x0003
. i
is loaded from address 20($fp)
and is ANDed with 0x3
.
If the result is 0, $3
will be equal to $0
($0
is guaranteed to be 0 in PISA and MIPS). Thus the branch is not taken and execution will go into the if
statement.
If the result is not 0, then $3
will not be equal to $0
. Thus the branch is taken and execution will skip the if
statement.
Assume you are keeping track of prior branch history for conditional branch B2. What is the smallest number of history bits ($h$) you need to correctly predict all B2’s branch outcomes?
The smallest number of history bit is 3. The following is how the history table predict conditional branch.
History Bits | Prediction |
---|---|
TTT | N |
NTT | T |
TNT | T |
TTN | T |
Assume you have a PAg predictor (Figure 3.20) with an eight-entries BHT and $h$ history bits per entry ($h$ is the previously computed number). Why would it be a bad idea to index the BHT with the 3 most significant PC bits?