Lecture 6: Wide Issue and Speculation
The goal is high-performance. The means are: high IPC and high clock rates.
We get high clock rates through pipelining (as well as advances in process
technology). However, pipelining hurts IPC because of pipelining hazards.
To address this, we must find more parallelism. Even in the ideal case,
the best IPC we can hope for in single-issue processors is 1.0. We will
see that by issuing multiple instructions every clock cycle, we can exceed
that limit. We will also see that speculating across control dependences
has the potential to increase parallelism and thus IPC.
Your book calls this technique multiple issue. Put simply,
it means issuing more than one instruction in a clock cycle. There are
many flavors of wide issue processors. This table taxonomizes them.
It is reproduced from your book on page 115.
The main idea is to fetch, decode, issue, and hopefully execute more than one
instruction per clock cycle. In this way, we can increase IPC above 1.0.
However, issuing multiple instructions per cycle adds more complexity
to the microarchitecture. It may have to be more deeply pipelined to
sustain the same clock rate as a single-issue version. All manufacturers
of general-purpose microprocessors have decided that the extra performance
is worth the extra complexity. Here are the several flavors of n-issue
Sun UltraSPARC II/III, embedded MIPS and ARM/Intel XScale
some out-of-order execution
dynamic with speculation
out-of-order with speculation
Intel Pentium 4, Intel Core, MIPS R12K, CompaQ Alpha EV6, IBM Power5
no hazards between issue packets
explicit dependences marked by compiler
Intel Itanium, Intel Itanium2
- Static superscalar. Up to n instructions are fetched in a
single cycle, usually stopping once n instructions or a predicted
taken branch is reached. The dependences are checked in the second part
of the issue phase, just like the single-issue case. The instructions
are scheduled in order, so the compiler has to do a good job of scheduling.
- Dynamic superscalar. Up to n instructions are fetched
in a single cycle, as before. However, they enter a dynamic scheduling
algorithm with reservation stations, etc. as in the last lecture.
- Speculative superscalar. This is dynamic superscalar with speculation,
i.e., branch predictions are speculatively acted upon.
- VLIW, for "very long instruction word." At the other extreme,
VLIW ISAs have ILP specified explicitly. "No" dependency checks are
done at run-time. At compile-time, many instructions are packed into a
single very long instruction. These long instructions are fetched one at
a time, forming an issue packet that is issued all at once. The compiler
is completely responsible for making sure that instructions within long
instructions are independent, and that instruction dependencies will all
have been satisfied. In the simplistic case, a stall in any functional unit
(e.g. a cache miss) causes the entire pipeline to freeze.
- EPIC, for "explicitly parallel instruction computing." This is
Intel's term for VLIW plus some dynamic checks. The compiler is still
responsible for scheduling instructions, but there is also speculation
that can be controlled by the compiler as well as the microarchitecture.
Static vs. Dynamic
These ideas can be divided into two camps:
We have discussed instruction scheduling before, but wide-issue is where
it becomes critically important. We have to make sure that instructions
that are issued in the same cycle have all their dependences met, including
dependences from the past and dependences from each other.
The two camps outlined above highlight the problem with wide-issue instruction
- Statically scheduled. Have the compiler worry about hazards, because:
- We believe the compiler can be smart about this, or
- We can't afford to do it dynamically because of constraints
related to the implementation.
- Dynamically scheduled. Have the microarchitecture worry about hazards,
- We don't trust the compiler, and
- We have enough transistors.
The fact that every type of wide-issue has existing examples shows that we
as a community haven't yet decided how we want to do scheduling.
- The compiler knows enough about the past and future to do a reasonable
job of scheduling. With static scheduling, the compiler does a lot of
work to figure out the schedule once. This work is amortized over every
execution of the scheduled code, which for production systems can mean
that the scheduling is essentially free.
- On the other hand, the microarchitecture potentially knows
everything about the past, and can do a reasonable job of predicting
the future, so it can do a better job of scheduling. In particular,
the microarchitecture can deal with problems such as aliasing that are
very difficult to deal with in the compiler (sometimes undecidable).
However, now the scheduling work is being done all the time, on-line.
This seems like a great waste of effort compared with static scheduling.
We have seen how branch prediction can speed up instruction fetch and
we have seen hints about how branch prediction can allow speculation.
Here, we'll go into more detail about hardware speculation. Later on,
we'll see how software can aid in speculation.
The basic idea is to treat branch predictions as if they are correct,
and speculatively execute the resulting instructions. The speculations
are verified, and if there is a branch misprediction something special
happens to get rid of the mis-speculated instructions.
We do this all in the context of dynamic scheduling, i.e., out-of-order
execution. However, now, there is an extra phase in the algorithm,
the commit phase. Instructions are committed when we are sure
they were supposed to execute, i.e., we know there is no misprediction.
Results are held in the reorder buffer (ROB) until they are ready
to be committed. The ROB can be thought of as a queue where instructions
are dequeued in order, i.e., instructions are fetched in-order, processed
out-of-order and placed into the ROB, and then "graduate" in-order again
from the ROB. The entries in the reorder buffer now form the virtual
register file that physical registers are renamed into (along with results
at reservation stations).
Here are the four phases:
This is the clean, idealized version of speculative out-of-order.
In practice, implementations work hard to reveal mispredictions as early
as possible, and weed out only those instructions that were issued after
the mis-speculated branch. Still, the cost of a mispredicted branch can
be very high, a minimum of 31 cycles on the Pentium 4, and 14 cycles on
Intel Core, so a good branch predictor is essential.
- Dequeue an instruction from the issue queue.
- Find a reservation station for it and reserve a slot for it the ROB.
- Send operands to the reservation station from the ROB.
- Send number of the reserved ROB entry to the reservation station
so it can place its result there.
- If reservation stations or ROB entires are unavailable, stall.
- Wait for data dependences to be satisfied at reservation
- When all operands at a reservation station are available,
execute the instruction.
- Write result
- Write the result of executing an instruction on the CDB and
into the ROB.
- Instructions waiting on this result go to the corresponding
- Store instructions write into the ROB, not the memory.
- Commit. Instructions are processed from the ROB in order. There are
- If the instruction is a branch with an incorrect prediction,
flush the ROB and resume fetch at the next instruction after the
- If the instruction is a store, then the value is stored to
memory. Since it has reached the head of the ROB in order,
it can't have been mispredicted.
- Otherwise, the instruction updates the register file.