Answered>Order 27596
FOR EACH PAPER.
1. Summary: Problem paper is trying to solve, key ideas/insights, mechanism, implementation. You will include key results and implementations.
2. Strenghts: Most important ones, does it solve the problem well?
Coming Challenges in Microarchitecture and
Architecture
RONNY RONEN, SENIOR MEMBER, IEEE, AVI MENDELSON, MEMBER, IEEE, KONRAD LAI,
SHIH-LIEN LU, MEMBER, IEEE, FRED POLLACK, AND JOHN P. SHEN, FELLOW, IEEE
Invited Paper
In the past several decades, the world of computers and
especially that of microprocessors has witnessed phenomenal
advances. Computers have exhibited ever-increasing performance
and decreasing costs, making them more affordable and, in turn,
accelerating additional software and hardware development
that fueled this process even more. The technology that enabled
this exponential growth is a combination of advancements in
process technology, microarchitecture, architecture, and design
and development tools. While the pace of this progress has been
quite impressive over the last two decades, it has become harder
and harder to keep up this pace. New process technology requires
more expensive megafabs and new performance levels require
larger die, higher power consumption, and enormous design and
validation effort. Furthermore, as CMOS technology continues
to advance, microprocessor design is exposed to a new set of
challenges. In the near future, microarchitecture has to consider
and explicitly manage the limits of semiconductor technology, such
as wire delays, power dissipation, and soft errors. In this paper,
we describe the role of microarchitecture in the computer world,
present the challenges ahead of us, and highlight areas where
microarchitecture can help address these challenges.
Keywords—Design tradeoffs, microarchitecture, microarchitecture
trends, microprocessor, performance improvements, power issues,
technology scaling.
I. INTRODUCTION
Microprocessors have gone through significant changes
during the last three decades; however, the basic computational
model has not changed much. A program consists of
instructions and data. The instructions are encoded in a specific
instruction set architecture (ISA). The computational
Manuscript received January 1, 2000; revised October 1, 2000.
R. Ronen and A. Mendelson are with the Microprocessor Research Laboratories,
Intel Corporation, Haifa 31015, Israel.
K. Lai and S.-L. Lu are with the Microprocessor Research Laboratories,
Intel Corporation, Hillsboro, OR 97124 USA.
F. Pollack and J. P. Shen are with the Microprocessor Research Laboratories,
Intel Corporation, Santa Clara, CA 95052 USA
Publisher Item Identifier S 0018-9219(01)02069-2.
model is still a single instruction stream, sequential execution
model, operating on the architecture states (memory and
registers). It is the job of the microarchitecture, the logic, and
the circuits to carry out this instruction stream in the “best”
way. “Best” depends on intended usage—servers, desktop,
and mobile—usually categorized as market segments. For
example, servers are designed to achieve the highest performance
possible while mobile systems are optimized for best
performance for a given power. Each market segment has different
features and constraints.
A. Fundamental Attributes
The key metrics for characterizing a microprocessor include:
performance, power, cost (die area), and complexity.
Performance is measured in terms of the time it takes
to complete a given task. Performance depends on many
parameters such as the microprocessor itself, the specific
workload, system configuration, compiler optimizations,
operating systems, and more. A concise characterization of
microprocessor performance was formulated by a number
of researchers in the 1980s; it has come to be known as the
“iron law” of central processing unit performance and is
shown below
Performance Execution Time
IPC Frequency Instruction Count
where is the average number of instructions completed
per cycle, is the number of clock cycles per
second, and is the total number of
instructions executed. Performance can be improved by
increasing IPC and/or frequency or by decreasing instruction
count. In practice, IPC varies depending on the environment—
the application, the system configuration, and more.
Instruction count depends on the ISA and the compiler
used. For a given executable program, where the instruction
0018-9219/01$10.00 © 2001 IEEE
PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001 325
stream is invariant, the relative performance depends only on
IPC Frequency. Performance here is measured in million
instructions per second (MIPS).
Commonly used benchmark suites have been defined to
quantify performance. Different benchmarks target different
market segments, such as SPEC [1] and SysMark [2]. A
benchmark suite consists of several applications. The time
it takes to complete this suite on a certain system reflects the
system performance.
Power is energy consumption per unit time, in watts.
Higher performance requires more power. However, power
is constrained due to the following.
• Power density and Thermal: The power dissipated by
the chip per unit area is measured in watts/cm . Increases
in power density causes heat to generate. In
order to keep transistors within their operating temperature
range, the heat generated has to be dissipated
from the source in a cost-effective manner. Power density
may soon limit performance growth due to thermal
dissipation constraints.
• Power Delivery: Power must be delivered to a very
large scale integration (VLSI) component at a prescribed
voltage and with sufficient amperage for the
component to run. Very precise voltage regulator/transformer
controls current supplies that can vary within
nanoseconds. As the current increases, the cost and
complexity of these voltage regulators/transformers
increase as well.
• Battery Life: Batteries are designed to support a certain
watts hours. The higher the power, the shorter the
time that a battery can operate.
Until recently, power efficiency was a concern only in battery
powered systems like notebooks and cell phones. Recently,
increased microprocessor complexity and frequency
have caused power consumption to grow to the level where
power has become a first-order issue. Today, each market
segment has its own power requirements and limits, making
power limitation a factor in any new microarchitecture. Maximum
power consumption is increased with the microprocessor
operating voltage ( ) and frequency (Frequency) as
follows:
where is the effective load capacitance of all devices and
wires on the microprocessor.Within some voltage range, frequency
may go up with supply voltage (
). This is a good way to gain performance,
but power is also increased (proportional to ). Another
important power related metric is the energy efficiency. Energy
efficiency is reflected by the performance/power ratio
and measured in MIPS/watt.
Cost is primarily determined by the physical size of the
manufactured silicon die. Larger area means higher (even
more than linear) manufacturing cost. Bigger die area usually
implies higher power consumption and may potentially
imply lower frequency due to longer wires. Manufacturing
yield also has direct impact on the cost of each microprocessor.
Complexity reflects the effort required to design, validate,
and manufacture a microprocessor. Complexity is affected by
the number of devices on the silicon die and the level of aggressiveness
in the performance, power and die area targets.
Complexity is discussed only implicitly in this paper.
B. Enabling Technologies
The microprocessor revolution owes it phenomenal
growth to a combination of enabling technologies: process
technology, circuit and logic techniques, microarchitecture,
architecture (ISA), and compilers.
Process technology is the fuel that has moved the entire
VLSI industry and the key to its growth. A new process generation
is released every two to three years. A process generation
is usually identified by the length of a metal-oxidesemidconductor
gate, measured in micrometers (10 m, denoted
as m). The most advanced process technology today
(year 2000) is 0.18 m [3].
Every new process generation brings significant improvements
in all relevant vectors. Ideally, process technology
scales by a factor of 0.7 all physical dimensions of devices
(transistors) and wires (interconnects) including those vertical
to the surface and all voltages pertaining to the devices
[4]. With such scaling, typical improvement figures are the
following:
• 1.4-1.5 times faster transistors;
• two times smaller transistors;
• 1.35 times lower operating voltage;
• three times lower switching power.
Theoretically, with the above figures, one would expect potential
improvements such as the following.
• Ideal Shrink: Use the same number of transistors to
gain 1.5 times performance, two times smaller die, and
two times less power.
• Ideal New Generation: Use two times the number of
transistors to gain three times performance with no increase
in die size and power.
In both ideal scenarios, there is three times gain in MIPS/watt
and no change in power density (watts/cm ).
In practice, it takes more than just process technology
to achieve such performance improvements and usually
at much higher costs. However, process technology is the
single most important technology that drives the microprocessor
industry. Growing 1000 times in frequency (from
1 MHz to 1 GHz) and integration (from 10k to 10M
devices) in 25 years was not possible without process
technology improvements.
Innovative circuit implementations can provide better performance
or lower power. New logic families provide new
methods to realize logic functions more effectively.
Microarchitecture attempts to increase both IPC and
frequency. A simple frequency boost applied to an existing
microarchitecture can potentially reduce IPC and thus
does not achieve the expected performance increase. For
326 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001
Fig. 1. Impact of different pipeline stalls on the execution flow.
example, memory accesses latency does not scale with microprocessor
frequency. Microarchitecture techniques such
as caches, branch prediction, and out-of-order execution can
increase IPC. Other microarchitecture ideas, most notably
pipelining, help to increase frequency beyond the increase
provided by process technology.
Modern architecture (ISA) and good optimizing compilers
can reduce the number of dynamic instructions executed
for a given program. Furthermore, given knowledge of
the underlying microarchitecture, compilers can produce optimized
code that lead to higher IPC.
This paper deals with the challenges facing architecture
and microarchitecture aspects of microprocessor design. A
brief tutorial/background on traditional microarchitecture is
given in Section II, focusing on frequency and IPC tradeoffs.
Section III describes the past and current trends in microarchitecture
and explains the limits of the current approaches
and the new challenges. Section IV suggests potential microarchitectural
solutions to these challenges.
II. MICROARCHITECTURE AT A GLANCE
Microprocessor performance depends on its frequency and
IPC. Higher frequency is achieved with process, circuit, and
microarchitectural improvements. New process technology
reduces gate delay time, thus cycle time, by 1.5 times. Microarchitecture
affects frequency by reducing the amount of
work done in each clock cycle, thus allowing shortening of
the clock cycle.
Microarchitects tend to divide the microprocessor’s functionality
into three major components [5].
• Instruction Supply: Fetching instructions, decoding
them, and preparing them for execution;
• Execution: Choosing instructions for execution, performing
actual computation, and writing results;
• Data Supply: Fetching data from the memory hierarchy
into the execution core.
A rudimentary microprocessor would process a complete
instruction before starting a new one. Modern microprocessors
use pipelining. Pipelining breaks the processing of
an instruction into a sequence of operations, called stages.
For example, in Fig. 1, a basic four-stage pipeline breaks
the instruction processing into fetch, decode, execute, and
write-back stages. A new instruction enters a stage as soon
as the previous one completes that stage. A pipelined microprocessor
with pipeline stages can overlap the processing
of instructions in the pipeline and, ideally, can deliver
times the performance of a nonpipelined one.
Pipelining is a very effective technique. There is a clear
trend of increasing the number of pipe stages and reducing
the amount of work per stage. Some microprocessors (e.g.,
Pentium Pro microprocessor [6]) have more than ten pipeline
stages. Employing many pipe stages is sometimes termed
deep pipelining or super pipelining.
Unfortunately, the number of pipeline stages cannot increase
indefinitely.
• There is a certain clocking overhead associated with
each pipe stage (setup and hold time, clock skew). As
cycle time becomes shorter, further increase in pipeline
length can actually decrease performance [7].
• Dependencies among instructions can require stalling
certain pipe stages and result in wasted cycles, causing
performance to scale less than linearly with the number
of pipe stages.
For a given partition of pipeline stages, the frequency of the
microprocessor is dictated by the latency of the slowest pipe
stage. More expensive logic and circuit optimizations help
to accelerate the speed of the logic within the slower pipe
stage, thus reducing the cycle time and increasing frequency
without increasing the number of pipe stages.
It is not always possible to achieve linear performance increase
with deeper pipelines. First, scaling frequency linearly
with the number of stages requires good balancing of the
overall work among the stages, which is difficult to achieve.
Second, with deeper pipes, the number of wasted cycles,
termed pipe stalls, grows. The main reasons for stalls are resource
contention, data dependencies, memory delays, and
control dependencies.
• Resource contention causes pipeline stall when an instruction
needs a resource (e.g., execution unit) that is
currently being used by another instruction in the same
cycle.
• Data dependency occurs when the result of one instruction
is needed as a source operand by another instruction.
The dependent instruction has to wait (stall)
until all its sources are available.
RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 327
Table 1
Out-Of-Order Execution Example
• Memory delays are caused by memory related data
dependencies, sometimes termed load-to-use delays.
Accessing memory can take between a few cycles to
hundreds of cycles, possibly requiring stalling the pipe
until the data arrives.
• Control dependency stalls occur when the control
flow of the program changes. A branch instruction
changes the address from which the next instruction
is fetched. The pipe may stall and instructions are not
fetched until the new fetch address is known.
Fig. 1 shows the impact of different pipeline stalls on the
execution flow within the pipeline.
In a 1-GHz microprocessor, accessing main memory can
take about 100 cycles. Such accesses may stall a pipelined
microprocessor for many cycles and seriously impact the
overall performance. To reduce memory stalls at a reasonable
cost, modern microprocessors take advantage of the locality
of references in the program and use a hierarchy of memory
components. A small, fast, and expensive (in $/bit) memory
called a cache is located on-die and holds frequently used
data. A somewhat bigger, but slower and cheaper cache may
be located between the microprocessor and the system bus,
which connects the microprocessor to the main memory. The
main memory is yet slower, but bigger and inexpensive.
Initially, caches were small and off-die; but over time,
they became bigger and were integrated on chip with the
microprocessor. Most advanced microprocessors today employ
two levels of caches on chip. The first level is 32-128
kB—it typically takes two to three cycles to access and typically
catches about 95% of all accesses. The second level is
256 kB to over 1 MB—it typically takes six to ten cycles to
access and catches over 50% of the misses of the first level.
As mentioned, off-chip memory accesses may elapse about
100 cycles.
Note that a cache miss that eventually has to go to the
main memory can take about the same amount of time as
executing 100 arithmetic and logic unit (ALU) instructions,
so the structure of memory hierarchy has a major impact on
performance. Much work has been done in improving cache
performance. Caches are made bigger and heuristics are used
to make sure the cache contains those portions of memory
that are most likely to be used [8], [9].
Change in the control flow can cause a stall. The length
of the stall is proportional to the length of the pipe. In
a super-pipelined machine, this stall can be quite long.
Modern microprocessors partially eliminate these stalls by
employing a technique called branch prediction. When a
branch is fetched, the microprocessor speculates the direction
(taken/not taken) and the target address where a branch
will go and starts speculatively executing from the predicted
address. Branch prediction uses both static and runtime
information to make its predictions. Branch predictors today
are very sophisticated. They use an assortment of per-branch
(local) and all-branches (global) history information and can
correctly predict over 95% of all conditional branches [10],
[11]. The prediction is verified when the predicted branch
reaches the execution stage and if found wrong, the pipe is
flushed and instructions are fetched from the correct target,
resulting in some performance loss. Note that when a wrong
prediction is made, useless work is done on processing
instructions from the wrong path.
The next step in performance enhancement beyond
pipelining calls for executing several instructions in parallel.
Instead of “scalar” execution, where in each cycle only one
instruction can be resident in each pipe stage, superscalar
execution is used, where two or more instructions can
be at the same pipe stage in the same cycle. Superscalar
designs require significant replication of resources in order
to support the fetching, decoding, execution, and writing
back of multiple instructions in every cycle. Theoretically,
an -way superscalar pipelined microprocessor can
improve performance by a factor of over a standard
scalar pipelined microprocessor. In practice, the speedup is
much smaller. Interinstruction dependencies and resource
contentions can stall the superscalar pipeline.
The microprocessors described so far execute instructions
in-order. That is, instructions are executed in the program
order. In an in-order processing, if an instruction cannot continue,
the entire machine stalls. For example, a cache miss
delays all following instructions even if they do not need the
results of the stalled load instruction. A major breakthrough
in boosting IPC is the introduction of out-of-order execution,
where instruction execution order depends on data flow, not
on the program order. That is, an instruction can execute if its
operands are available, even if previous instructions are still
waiting. Note that instructions are still fetched in order. The
effect of superscalar and out-of-order processing is shown in
an example in Table 1 where two memory words mem1 and
mem3 are copied into two other memory locations mem2 and
mem4.
Out-of-order processing hides some stalls. For example,
while waiting for a cache miss, the microprocessor can
execute newer instructions as long as they are independent
of the load instructions. A superscalar out-of-order
microprocessor can achieve higher IPC than a superscalar
in-order microprocessor. Out-of-order execution involves
dependency analysis and instruction scheduling. Therefore,
it takes a longer time (more pipe stages) to process an
328 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001
Fig. 2. Processor frequencies over years. (Source: V. De, Intel, ISLPED, Aug. 1999.)
instruction in an out-of- order microprocessor.With a deeper
pipe, an out-of-order microprocessor suffers more from
branch mispredictions. Needless to say, an out-of-order
microprocessor, especially a wide-issue one, is much more
complex and power hungry than an in-order microprocessor
[12].
Historically, there were two schools of thought on how to
achieve higher performance. The “Speed Demons” school
focused on increasing frequency. The “Brainiacs” focused
on increasing IPC [13], [14]. Historically, DEC Alpha [15]
was an example of the superiority of “Speed Demons” over
the “Brainiacs.” Over the years, it has become clear that high
performance must be achieved by progressing in both vectors
(see Fig. 4).
To complete the picture, we revisit the issues of performance
and power. A microprocessor consumes a certain
amount of energy, , in processing an instruction. This
amount increases with the complexity of the microprocessor.
For example, an out-of-order microprocessor consumes
more energy per instruction than an in-order microprocessor.
When speculation is employed, some processed instructions
are later discarded. The ratio of useful to total number
of processed instructions is . The total IPC including speculated
instructions is therefore IPC/ . Given these observations
a number of conclusions can be drawn. The energy per
second, hence power, is proportional to the amount of processed
instructions per second and the amount of energy consumed
per instruction, that is (IPC/ ) Frequency . The
energy efficiency, measured in MIPS/watt, is proportional to
. This value deteriorates as speculation increases and
complexity grows.
One main goal of microarchitecture research is to design a
microprocessor that can accomplish a group of tasks (applications)
in the shortest amount of time while using minimum
amount of power and incurring the least amount of cost. The
design process involves evaluating many parameters and balancing
these three targets optimally with given process and
circuit technology.
III. MICROPROCESSORS—CURRENT TRENDS AND
CHALLENGES
In the past 25 years, chip density and the associated computer
industry has grown at an exponential rate. This phenomenon
is known as “Moore’s Law” and characterizes almost
every aspect of this industry, such as transistor density,
die area, microprocessor frequency, and power consumption.
This trend was possible due to the improvements in fabrication
process technology and microprocessor microarchitecture.
This section focuses on the architectural and the microarchitectural
improvements over the years and elaborates
on some of the current challenges the microprocessor industry
is facing.
A. Improving Performance
As stated earlier, performance can be improved by increasingIPCand/
or frequencyorbydecreasing the instruction
count. Several architecture directions have been taken to
improve performance. Reduced instruction set computer
(RISC) architecture seeks to increase both frequency and IPC
via pipelining and use of cache memories at the expense of
increased instruction count.Complexinstruction setcomputer
(CISC) microprocessors employ RISC-like internal representation
to achieve higher frequency while maintaining lower
instruction count. Recently, the very long instruction word
(VLIW) [16] concept was revived with the Explicitly Parallel
Instruction Computing (EPIC) [17]. EPIC uses the compiler
to schedule instruction statically. Exploiting parallelism staticallycanenablesimpler
control logicandhelpEPICto achieve
higherIPCandhigher frequency.
1) Improving Frequency via Pipelining: Process technology
and microarchitecture innovations enable doubling
the frequency increase every process generation. Fig. 2
presents the contribution of both: as the process improves,
the frequency increases and the average amount of work
done in pipeline stages decreases. For example, the number
of gate delays per pipe stage was reduced by about three
RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 329
Fig. 3. Frequency and performance improvements—synthetic model. (Source: E. Grochowski,
Intel, 1997.)
times over a period of ten years. Reducing the stage length
is achieved by improving design techniques and increasing
the number of stages in the pipe. While in-order microprocessors
used four to five pipe stages, modern out-of-order
microprocessors can use over ten pipe stages. With frequencies
higher than 1 GHz, we can expect over 20 pipeline
stages.
Improvement in frequency does not always improve
performance. Fig. 3 measures the impact of increasing the
number of pipeline stages on performance using a synthetic
model of an in-order superscalar machine. Performance
scales less than frequency (e.g., going from 6 to 12 stages
yields only a 1.75 times speedup, from 6 to 23 yields only 2.2
times). Performance improves less than linearly due to cache
misses and branch mispredictions. There are two interesting
singular points in the graph that deserve special attention.
The first (at pipeline depth of 13 stages) reflects the point
where the cycle time becomes so short that two cycles are
needed to reach the first level cache. The second (at pipeline
depth of 24 stages) reflects the point where the cycle time
becomes extremely short so that two cycles are needed
to complete even a simple ALU operation. Increasing the
latency of basic operations introduces more pipeline stalls
and impacts performance significantly. Please note that these
trends are true for any pipeline design though the specific
data points may vary depending on the architecture and the
process. In order to keep the pace of performance growth,
one of the main challenges is to increase the frequency
without negatively impacting the IPC. The next sections
discuss some IPC related issues.
2) Instruction Supply Challenges: The instruction
supply is responsible for feeding the pipeline with useful
instructions. The rate of instructions entering the pipeline
depends on the fetch bandwidth and the fraction of useful
instructions in that stream. The fetch rate depends on the
effectiveness of the memory subsystem and is discussed
later along with data supply issues. The number of useful
instructions in the instruction stream depends on the ISA and
the handling of branches. Useless instructions result from: 1)
control flow change within a block of fetched instructions,
leaving the rest of the cache block unused; and 2) branch
misprediction brings instructions from the wrong path that
are later discarded. On average, a branch occurs every four
to five instructions. Hence, appropriate fetch bandwidth and
accurate branch prediction are crucial.
Once instructions are fetched into the machine they are
decoded. RISC architectures, using fixed length instructions,
can easily decode instructions in parallel. Parallel decoding is
a major challenge for CISC architectures, such as IA32, that
use variable length instructions. Some implementations [18]
use speculative decoders to decode from several potential instruction
addresses and later discard the wrong ones; others
[19] store additional information in the instruction cache to
ease decoding. Some IA32 implementations (e.g., the Pentium
II microprocessor) translate the IA32 instructions into
an internal representation (micro-operations), allowing the
internal part of the microprocessor to work on simple instructions
at high frequency, similar to RISC microprocessors.
3) Efficient Execution: The front-end stages of the
pipeline prepare the instructions in either an instruction
330 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001
Fig. 4. Landscape of microprocessor families.
window [20] or reservation stations [21]. The execution core
schedules and executes these instructions. Modern microprocessors
use multiple execution units to increase parallelism.
Performancegainislimitedbytheamountofparallelismfound
inthe instructionwindow.Theparallelism intoday’smachines
is limited by the data dependencies in the program and by
memorydelaysandresource contention stalls.
Studies show that in theory, high levels of parallelism are
achievable [22]. In practice, however, this parallelism is not
realized, even when the number of execution units is abundant.
More parallelism requires higher fetch bandwidth, a
larger instruction window, and a wider dependency tracker
and instruction scheduler. Enlarging such structures involves
polynomial complexity increase for less than a linear performance
gain (e.g., scheduling complexity is on the order of
O of the scheduling window size [23]). VLIW architectures
[16] such as IA64 EPIC [17] avoid some of this complexity
by using the compiler to schedule instructions.
Accurate branch prediction is critical for deep pipelines in
reducing misprediction penalty. Branch predictors have become
larger and more sophisticated. The Pentium microprocessor
[18] uses 256 entries of 2-bit predictors (the predictor
and the target arrays consume 15 kB) that achieve 85%
correct prediction rate. The Pentium III microprocessor [24]
uses 512 entries of two-level local branch predictor (consuming
30 kB) and yields 90% prediction rate. The Alph