Lab 1

What is the signiﬁcance of the phrase “Assume that the Writeback stage writes to the register ﬁle (RF) in the ﬁrst half of a cycle, while the Decode stage reads from the register ﬁle in the second half of a cycle” from Section 3? Give an example of how the register ﬁle access order can aﬀect the number of RAW stall cycles.

The order in which the register file is accessed by pipeline stages affects the number of RAW stall cycles that can occur in a pipeline.

If the Writeback stage writes to the register file in the second half of a cycle, while the Decode stage reads from the register file in the first half of a cycle. In this case, if an instruction in the Decode stage requires data that has been modified by an instruction in the Writeback stage, a RAW hazard will occur, and a stall cycle will be inserted to resolve the dependency.
Are WAW hazards possible in a simple 5-stage pipeline? Explain your answer.

WAW hazards are not possible because all instructions modifying the register ﬁle reach the WB stage in process order.
What is the diﬀerence between a data hazard and a structural hazard?

Data hazards refers to situations where instructions in a pipeline exhibit data dependence and executing different stages of the pipeline may break that dependence, such as RAW, WAW, WAR.

Structural hazard occurs when two or more instructions in a pipeline require the same hardware resource, such as a memory or a register resources.
What is the diﬀerence between a dependence and a hazard?

Dependence refers to a relationship between two instructions, where one instruction needs the result of the other instruction to execute.

Hazard refers to a situation where the pipeline execution of instructions is disrupted due to
1. Data dependencies (data hazard, such as RAW, WAW, WAR)
2. Conditional branches are resolved in the EX stage by predicting the branch not taken (control hazards)
3. Conﬂicts for hardware resources (Structural hazards)

Provide a high-level algorithmic solution for each question of Section 3, along with pipeline diagrams. Think of a series of instructions (e.g., i1 to ix), and specify the necessary conditions for a hazard to occur (e.g., data dependence, distance of instructions in the pipeline). Provide examples for all possible stall scenarios; for example, the conditions will be diﬀerent for a one-cycle stall and a two-cycles stall. We provide empty pipeline diagrams in the appendix. Feel free to print multiple copies of that page and ﬁll-in the tables as needed. This is an important step. Not only will it help you brainstorm, but it will also help the TA identify any issues with your logic.

Q1: What is the slowdown in performance due to read-after-write (RAW) hazards considering our 5-stage pipeline with NO forwarding or bypassing? Assume an ideal memory system and ignore control hazards.

Cycles
Instr.	c1	c2	c3	c4	c5	c6	c7	c8	c9	c10
`LW R1 (0)R2`	F	D	EX	M	WB
`ADD R3, R1, #1`		F	D	*	*	EX	M	WB
`ADD R4, R1, #1`			F	*	*	D	EX	M	WB
`ADD R5, R1, #1`						F	D	EX	M	WB
`ADD R6, R1, #1`							F	D	EX	M

It’s possible to have 2 cycles of stall when D stage need register which is the result in previous calculations or load/store.

Cycles
Instr.	c1	c2	c3	c4	c5	c6	c7	c8	c9	c10
`LW R1 (0)R2`	F	D	EX	M	WB
`nop`		F	D	EX	M	WB
`ADD R4, R1, #1`			F	D	*	EX	M	WB
`ADD R5, R1, #1`				F	*	D	EX	M	WB
`ADD R6, R1, #1`						F	D	EX	M	WB

It’s possible to have 1 cycle stall when D stage need register which the the result of the calculations or load/store two cycles before

Q2: Now, consider a 6-stage pipeline where Execute is broken into 2 stages, EX1 and EX2, making the full pipeline: F, D, EX1, EX2, M, WB. This design has full forwarding and bypassing. Any inputs needed by Execute must be available at the start of EX1. Both stages (EX1 and EX2) are needed to compute the ﬁnal result. What is the slowdown in performance due to read-after-write (RAW) hazards for this design? Assume an ideal memory system and ignore control hazards.

Cycles
Instr.	c1	c2	c3	c4	c5	c6	c7	c8	c9
`LW R1 (0)R2`	F	D	EX1	EX2	M	WB
`ADD R3, R1, #1`		F	D	*	*	EX1	EX2	M	WB
`ADD R4, R1, #1`			F	*	*	D	EX1	EX2	M

It’s possible to have 2 cycles of stall when LW instruction is followed by a instruction which uses the register LW is loaded into.

However, if the next instruction of LW is a SW that store the exact register LW is loading into, then the processor doesn’t have to stall.

Cycles
Instr.	c1	c2	c3	c4	c5	c6	c7	c8	c9
`LW R1 (0)R2`	F	D	EX1	EX2	M	WB
`SW R1, (0)R3`		F	D	EX1	EX2	M	WB

Cycles
Instr.	c1	c2	c3	c4	c5	c6	c7	c8	c9
`LW R1 (0)R2`	F	D	EX1	EX2	M	WB
`nop`		F	D	EX1	EX2	M	WB
`ADD R4, R1, #1`			F	D	*	EX1	EX2	M	WB

It’s possible to have 2 cycles of stall when LW instruction is followed by a nop instruction and a instruction which uses the register LW is loaded into, i.e. one immediate instruction between two instruction.

Cycles
Instr.	c1	c2	c3	c4	c5	c6	c7	c8	c9
`ADD R1, R2, #1`	F	D	EX1	EX2	M	WB
`ADD R3, R1, #1`		F	D	*	EX1	EX2	M	WB
`ADD R4, R1, #1`			F	*	D	EX1	EX2	M	WB

It’s possible to have 1 cycles of stall when other arithmetic instruction is followed by a instruction which uses the result from the ALU unit

Lab 2

Why do we use predictors with 2-bit saturating counters (i.e., bimodal)? Contrast this with the 1-bit approach.

The basic idea with 2-bit saturating counters is to change the prediction after two consecutive mispredictions rather than after every misprediction, as in the 1-bit predictor.

The main contribution of 2-bit saturating counter is shown in the following nested loop.
```
for (i=0 ; i<100; i++)
    for (j=0; j<3; j++)
        ...
```
When execution exits the inner loop, the prediction remains “taken,” (go into the for loop) and the branch is predicted correctly when the inner loop is re-entered from the outer loop. Whereas 1 bit approach will predict the inner loop as “not taken” (skip for loop) when the inner loop is entered again.
You are given the following code snippet, with two conditional branches clearly marked:
```
int a;

for (int i = 0; i < 100000; i++) {  // conditional branch B1
    if ((i % 4) == 0) {  // conditional branch B2
        a = 10;
        a = 15;
    }
}
```
Write down the taken/not-taken sequence for the second conditional branch B2. List any assumptions you make about how this if-statement translates in assembly.

Time 1 2 3 4 5 6 7 8 9

T/N N T T T N T T T N

The following is the assembly code compiled using ssbig-na-sstrix-gcc using flag -O0
```
$L5:
	lw	$2,20($fp)
	andi	$3,$2,0x0003
	bne	$3,$0,$L6
	li	$2,0x0000000a		# 10
	sw	$2,16($fp)
	li	$2,0x0000000f		# 15
	sw	$2,16($fp)
$L6:
$L4:
	lw	$3,20($fp)
	addu	$2,$3,1
	move	$3,$2
	sw	$3,20($fp)
	j	$L2
```
The condition (i % 4) == 0 is implemented using andi $3,$2,0x0003. i is loaded from address 20($fp) and is ANDed with 0x3.

If the result is 0, $3 will be equal to $0($0 is guaranteed to be 0 in PISA and MIPS). Thus the branch is not taken and execution will go into the if statement.

If the result is not 0, then $3 will not be equal to $0. Thus the branch is taken and execution will skip the if statement.

Assume you are keeping track of prior branch history for conditional branch B2. What is the smallest number of history bits ($h$) you need to correctly predict all B2’s branch outcomes?

The smallest number of history bit is 3. The following is how the history table predict conditional branch.

History Bits Prediction

TTT N

NTT T

TNT T

TTN T

Assume you have a PAg predictor (Figure 3.20) with an eight-entries BHT and $h$ history bits per entry ($h$ is the previously computed number). Why would it be a bad idea to index the BHT with the 3 most signiﬁcant PC bits?

Time	1	2	3	4	5	6	7	8	9
T/N	N	T	T	T	N	T	T	T	N

History Bits	Prediction
TTT	N
NTT	T
TNT	T
TTN	T

Lab 3

How can the Tomasulo algorithm extract more ILP for a given program compared to an in-order processor?
What are reservation stations used for? List all the ﬁelds of a reservation station entry and explain their functionality.
How does Tomasulo’s algorithm deal with false-dependences? Provide a brief example.How does Tomasulo’s algorithm deal with false-dependences? Provide a brief example.

Lab 4

Would a next line-prefetcher work well for an instruction cache? When would it issue useless prefetches?
Can data prefetching be harmful for the performance of a system? Provide an example.
How could you address the issue of harmful prefetches, assuming you cannot turn oﬀ your data prefetcher?

Lab 5

Why are transient states necessary?
Why does a coherence protocol use stalls?