Answered>Order 27596

FOR EACH PAPER.

1. Summary: Problem paper is trying to solve, key ideas/insights, mechanism, implementation. You will include key results and implementations.

2. Strenghts: Most important ones, does it solve the problem well?

Coming Challenges in Microarchitecture and

Architecture

RONNY RONEN, SENIOR MEMBER, IEEE, AVI MENDELSON, MEMBER, IEEE, KONRAD LAI,

SHIH-LIEN LU, MEMBER, IEEE, FRED POLLACK, AND JOHN P. SHEN, FELLOW, IEEE

Invited Paper

In the past several decades, the world of computers and

especially that of microprocessors has witnessed phenomenal

advances. Computers have exhibited ever-increasing performance

and decreasing costs, making them more affordable and, in turn,

accelerating additional software and hardware development

that fueled this process even more. The technology that enabled

this exponential growth is a combination of advancements in

process technology, microarchitecture, architecture, and design

and development tools. While the pace of this progress has been

quite impressive over the last two decades, it has become harder

and harder to keep up this pace. New process technology requires

more expensive megafabs and new performance levels require

larger die, higher power consumption, and enormous design and

validation effort. Furthermore, as CMOS technology continues

to advance, microprocessor design is exposed to a new set of

challenges. In the near future, microarchitecture has to consider

and explicitly manage the limits of semiconductor technology, such

as wire delays, power dissipation, and soft errors. In this paper,

we describe the role of microarchitecture in the computer world,

present the challenges ahead of us, and highlight areas where

microarchitecture can help address these challenges.

Keywords—Design tradeoffs, microarchitecture, microarchitecture

trends, microprocessor, performance improvements, power issues,

technology scaling.

I. INTRODUCTION

Microprocessors have gone through significant changes

during the last three decades; however, the basic computational

model has not changed much. A program consists of

instructions and data. The instructions are encoded in a specific

instruction set architecture (ISA). The computational

Manuscript received January 1, 2000; revised October 1, 2000.

R. Ronen and A. Mendelson are with the Microprocessor Research Laboratories,

Intel Corporation, Haifa 31015, Israel.

K. Lai and S.-L. Lu are with the Microprocessor Research Laboratories,

Intel Corporation, Hillsboro, OR 97124 USA.

F. Pollack and J. P. Shen are with the Microprocessor Research Laboratories,

Intel Corporation, Santa Clara, CA 95052 USA

Publisher Item Identifier S 0018-9219(01)02069-2.

model is still a single instruction stream, sequential execution

model, operating on the architecture states (memory and

registers). It is the job of the microarchitecture, the logic, and

the circuits to carry out this instruction stream in the “best”

way. “Best” depends on intended usage—servers, desktop,

and mobile—usually categorized as market segments. For

example, servers are designed to achieve the highest performance

possible while mobile systems are optimized for best

performance for a given power. Each market segment has different

features and constraints.

A. Fundamental Attributes

The key metrics for characterizing a microprocessor include:

performance, power, cost (die area), and complexity.

Performance is measured in terms of the time it takes

to complete a given task. Performance depends on many

parameters such as the microprocessor itself, the specific

workload, system configuration, compiler optimizations,

operating systems, and more. A concise characterization of

microprocessor performance was formulated by a number

of researchers in the 1980s; it has come to be known as the

“iron law” of central processing unit performance and is

shown below

Performance Execution Time

IPC Frequency Instruction Count

where is the average number of instructions completed

per cycle, is the number of clock cycles per

second, and is the total number of

instructions executed. Performance can be improved by

increasing IPC and/or frequency or by decreasing instruction

count. In practice, IPC varies depending on the environment—

the application, the system configuration, and more.

Instruction count depends on the ISA and the compiler

used. For a given executable program, where the instruction

0018-9219/01$10.00 © 2001 IEEE

PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001 325

stream is invariant, the relative performance depends only on

IPC Frequency. Performance here is measured in million

instructions per second (MIPS).

Commonly used benchmark suites have been defined to

quantify performance. Different benchmarks target different

market segments, such as SPEC [1] and SysMark [2]. A

benchmark suite consists of several applications. The time

it takes to complete this suite on a certain system reflects the

system performance.

Power is energy consumption per unit time, in watts.

Higher performance requires more power. However, power

is constrained due to the following.

• Power density and Thermal: The power dissipated by

the chip per unit area is measured in watts/cm . Increases

in power density causes heat to generate. In

order to keep transistors within their operating temperature

range, the heat generated has to be dissipated

from the source in a cost-effective manner. Power density

may soon limit performance growth due to thermal

dissipation constraints.

• Power Delivery: Power must be delivered to a very

large scale integration (VLSI) component at a prescribed

voltage and with sufficient amperage for the

component to run. Very precise voltage regulator/transformer

controls current supplies that can vary within

nanoseconds. As the current increases, the cost and

complexity of these voltage regulators/transformers

increase as well.

• Battery Life: Batteries are designed to support a certain

watts hours. The higher the power, the shorter the

time that a battery can operate.

Until recently, power efficiency was a concern only in battery

powered systems like notebooks and cell phones. Recently,

increased microprocessor complexity and frequency

have caused power consumption to grow to the level where

power has become a first-order issue. Today, each market

segment has its own power requirements and limits, making

power limitation a factor in any new microarchitecture. Maximum

power consumption is increased with the microprocessor

operating voltage ( ) and frequency (Frequency) as

follows:

where is the effective load capacitance of all devices and

wires on the microprocessor.Within some voltage range, frequency

may go up with supply voltage (

). This is a good way to gain performance,

but power is also increased (proportional to ). Another

important power related metric is the energy efficiency. Energy

efficiency is reflected by the performance/power ratio

and measured in MIPS/watt.

Cost is primarily determined by the physical size of the

manufactured silicon die. Larger area means higher (even

more than linear) manufacturing cost. Bigger die area usually

implies higher power consumption and may potentially

imply lower frequency due to longer wires. Manufacturing

yield also has direct impact on the cost of each microprocessor.

Complexity reflects the effort required to design, validate,

and manufacture a microprocessor. Complexity is affected by

the number of devices on the silicon die and the level of aggressiveness

in the performance, power and die area targets.

Complexity is discussed only implicitly in this paper.

B. Enabling Technologies

The microprocessor revolution owes it phenomenal

growth to a combination of enabling technologies: process

technology, circuit and logic techniques, microarchitecture,

architecture (ISA), and compilers.

Process technology is the fuel that has moved the entire

VLSI industry and the key to its growth. A new process generation

is released every two to three years. A process generation

is usually identified by the length of a metal-oxidesemidconductor

gate, measured in micrometers (10 m, denoted

as m). The most advanced process technology today

(year 2000) is 0.18 m [3].

Every new process generation brings significant improvements

in all relevant vectors. Ideally, process technology

scales by a factor of 0.7 all physical dimensions of devices

(transistors) and wires (interconnects) including those vertical

to the surface and all voltages pertaining to the devices

[4]. With such scaling, typical improvement figures are the

following:

• 1.4-1.5 times faster transistors;

• two times smaller transistors;

• 1.35 times lower operating voltage;

• three times lower switching power.

Theoretically, with the above figures, one would expect potential

improvements such as the following.

• Ideal Shrink: Use the same number of transistors to

gain 1.5 times performance, two times smaller die, and

two times less power.

• Ideal New Generation: Use two times the number of

transistors to gain three times performance with no increase

in die size and power.

In both ideal scenarios, there is three times gain in MIPS/watt

and no change in power density (watts/cm ).

In practice, it takes more than just process technology

to achieve such performance improvements and usually

at much higher costs. However, process technology is the

single most important technology that drives the microprocessor

industry. Growing 1000 times in frequency (from

1 MHz to 1 GHz) and integration (from 10k to 10M

devices) in 25 years was not possible without process

technology improvements.

Innovative circuit implementations can provide better performance

or lower power. New logic families provide new

methods to realize logic functions more effectively.

Microarchitecture attempts to increase both IPC and

frequency. A simple frequency boost applied to an existing

microarchitecture can potentially reduce IPC and thus

does not achieve the expected performance increase. For

326 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

Fig. 1. Impact of different pipeline stalls on the execution flow.

example, memory accesses latency does not scale with microprocessor

frequency. Microarchitecture techniques such

as caches, branch prediction, and out-of-order execution can

increase IPC. Other microarchitecture ideas, most notably

pipelining, help to increase frequency beyond the increase

provided by process technology.

Modern architecture (ISA) and good optimizing compilers

can reduce the number of dynamic instructions executed

for a given program. Furthermore, given knowledge of

the underlying microarchitecture, compilers can produce optimized

code that lead to higher IPC.

This paper deals with the challenges facing architecture

and microarchitecture aspects of microprocessor design. A

brief tutorial/background on traditional microarchitecture is

given in Section II, focusing on frequency and IPC tradeoffs.

Section III describes the past and current trends in microarchitecture

and explains the limits of the current approaches

and the new challenges. Section IV suggests potential microarchitectural

solutions to these challenges.

II. MICROARCHITECTURE AT A GLANCE

Microprocessor performance depends on its frequency and

IPC. Higher frequency is achieved with process, circuit, and

microarchitectural improvements. New process technology

reduces gate delay time, thus cycle time, by 1.5 times. Microarchitecture

affects frequency by reducing the amount of

work done in each clock cycle, thus allowing shortening of

the clock cycle.

Microarchitects tend to divide the microprocessor’s functionality

into three major components [5].

• Instruction Supply: Fetching instructions, decoding

them, and preparing them for execution;

• Execution: Choosing instructions for execution, performing

actual computation, and writing results;

• Data Supply: Fetching data from the memory hierarchy

into the execution core.

A rudimentary microprocessor would process a complete

instruction before starting a new one. Modern microprocessors

use pipelining. Pipelining breaks the processing of

an instruction into a sequence of operations, called stages.

For example, in Fig. 1, a basic four-stage pipeline breaks

the instruction processing into fetch, decode, execute, and

write-back stages. A new instruction enters a stage as soon

as the previous one completes that stage. A pipelined microprocessor

with pipeline stages can overlap the processing

of instructions in the pipeline and, ideally, can deliver

times the performance of a nonpipelined one.

Pipelining is a very effective technique. There is a clear

trend of increasing the number of pipe stages and reducing

the amount of work per stage. Some microprocessors (e.g.,

Pentium Pro microprocessor [6]) have more than ten pipeline

stages. Employing many pipe stages is sometimes termed

deep pipelining or super pipelining.

Unfortunately, the number of pipeline stages cannot increase

indefinitely.

• There is a certain clocking overhead associated with

each pipe stage (setup and hold time, clock skew). As

cycle time becomes shorter, further increase in pipeline

length can actually decrease performance [7].

• Dependencies among instructions can require stalling

certain pipe stages and result in wasted cycles, causing

performance to scale less than linearly with the number

of pipe stages.

For a given partition of pipeline stages, the frequency of the

microprocessor is dictated by the latency of the slowest pipe

stage. More expensive logic and circuit optimizations help

to accelerate the speed of the logic within the slower pipe

stage, thus reducing the cycle time and increasing frequency

without increasing the number of pipe stages.

It is not always possible to achieve linear performance increase

with deeper pipelines. First, scaling frequency linearly

with the number of stages requires good balancing of the

overall work among the stages, which is difficult to achieve.

Second, with deeper pipes, the number of wasted cycles,

termed pipe stalls, grows. The main reasons for stalls are resource

contention, data dependencies, memory delays, and

control dependencies.

• Resource contention causes pipeline stall when an instruction

needs a resource (e.g., execution unit) that is

currently being used by another instruction in the same

cycle.

• Data dependency occurs when the result of one instruction

is needed as a source operand by another instruction.

The dependent instruction has to wait (stall)

until all its sources are available.

RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 327

Table 1

Out-Of-Order Execution Example

• Memory delays are caused by memory related data

dependencies, sometimes termed load-to-use delays.

Accessing memory can take between a few cycles to

hundreds of cycles, possibly requiring stalling the pipe

until the data arrives.

• Control dependency stalls occur when the control

flow of the program changes. A branch instruction

changes the address from which the next instruction

is fetched. The pipe may stall and instructions are not

fetched until the new fetch address is known.

Fig. 1 shows the impact of different pipeline stalls on the

execution flow within the pipeline.

In a 1-GHz microprocessor, accessing main memory can

take about 100 cycles. Such accesses may stall a pipelined

microprocessor for many cycles and seriously impact the

overall performance. To reduce memory stalls at a reasonable

cost, modern microprocessors take advantage of the locality

of references in the program and use a hierarchy of memory

components. A small, fast, and expensive (in $/bit) memory

called a cache is located on-die and holds frequently used

data. A somewhat bigger, but slower and cheaper cache may

be located between the microprocessor and the system bus,

which connects the microprocessor to the main memory. The

main memory is yet slower, but bigger and inexpensive.

Initially, caches were small and off-die; but over time,

they became bigger and were integrated on chip with the

microprocessor. Most advanced microprocessors today employ

two levels of caches on chip. The first level is 32-128

kB—it typically takes two to three cycles to access and typically

catches about 95% of all accesses. The second level is

256 kB to over 1 MB—it typically takes six to ten cycles to

access and catches over 50% of the misses of the first level.

As mentioned, off-chip memory accesses may elapse about

100 cycles.

Note that a cache miss that eventually has to go to the

main memory can take about the same amount of time as

executing 100 arithmetic and logic unit (ALU) instructions,

so the structure of memory hierarchy has a major impact on

performance. Much work has been done in improving cache

performance. Caches are made bigger and heuristics are used

to make sure the cache contains those portions of memory

that are most likely to be used [8], [9].

Change in the control flow can cause a stall. The length

of the stall is proportional to the length of the pipe. In

a super-pipelined machine, this stall can be quite long.

Modern microprocessors partially eliminate these stalls by

employing a technique called branch prediction. When a

branch is fetched, the microprocessor speculates the direction

(taken/not taken) and the target address where a branch

will go and starts speculatively executing from the predicted

address. Branch prediction uses both static and runtime

information to make its predictions. Branch predictors today

are very sophisticated. They use an assortment of per-branch

(local) and all-branches (global) history information and can

correctly predict over 95% of all conditional branches [10],

[11]. The prediction is verified when the predicted branch

reaches the execution stage and if found wrong, the pipe is

flushed and instructions are fetched from the correct target,

resulting in some performance loss. Note that when a wrong

prediction is made, useless work is done on processing

instructions from the wrong path.

The next step in performance enhancement beyond

pipelining calls for executing several instructions in parallel.

Instead of “scalar” execution, where in each cycle only one

instruction can be resident in each pipe stage, superscalar

execution is used, where two or more instructions can

be at the same pipe stage in the same cycle. Superscalar

designs require significant replication of resources in order

to support the fetching, decoding, execution, and writing

back of multiple instructions in every cycle. Theoretically,

an -way superscalar pipelined microprocessor can

improve performance by a factor of over a standard

scalar pipelined microprocessor. In practice, the speedup is

much smaller. Interinstruction dependencies and resource

contentions can stall the superscalar pipeline.

The microprocessors described so far execute instructions

in-order. That is, instructions are executed in the program

order. In an in-order processing, if an instruction cannot continue,

the entire machine stalls. For example, a cache miss

delays all following instructions even if they do not need the

results of the stalled load instruction. A major breakthrough

in boosting IPC is the introduction of out-of-order execution,

where instruction execution order depends on data flow, not

on the program order. That is, an instruction can execute if its

operands are available, even if previous instructions are still

waiting. Note that instructions are still fetched in order. The

effect of superscalar and out-of-order processing is shown in

an example in Table 1 where two memory words mem1 and

mem3 are copied into two other memory locations mem2 and

mem4.

Out-of-order processing hides some stalls. For example,

while waiting for a cache miss, the microprocessor can

execute newer instructions as long as they are independent

of the load instructions. A superscalar out-of-order

microprocessor can achieve higher IPC than a superscalar

in-order microprocessor. Out-of-order execution involves

dependency analysis and instruction scheduling. Therefore,

it takes a longer time (more pipe stages) to process an

328 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

Fig. 2. Processor frequencies over years. (Source: V. De, Intel, ISLPED, Aug. 1999.)

instruction in an out-of- order microprocessor.With a deeper

pipe, an out-of-order microprocessor suffers more from

branch mispredictions. Needless to say, an out-of-order

microprocessor, especially a wide-issue one, is much more

complex and power hungry than an in-order microprocessor

[12].

Historically, there were two schools of thought on how to

achieve higher performance. The “Speed Demons” school

focused on increasing frequency. The “Brainiacs” focused

on increasing IPC [13], [14]. Historically, DEC Alpha [15]

was an example of the superiority of “Speed Demons” over

the “Brainiacs.” Over the years, it has become clear that high

performance must be achieved by progressing in both vectors

(see Fig. 4).

To complete the picture, we revisit the issues of performance

and power. A microprocessor consumes a certain

amount of energy, , in processing an instruction. This

amount increases with the complexity of the microprocessor.

For example, an out-of-order microprocessor consumes

more energy per instruction than an in-order microprocessor.

When speculation is employed, some processed instructions

are later discarded. The ratio of useful to total number

of processed instructions is . The total IPC including speculated

instructions is therefore IPC/ . Given these observations

a number of conclusions can be drawn. The energy per

second, hence power, is proportional to the amount of processed

instructions per second and the amount of energy consumed

per instruction, that is (IPC/ ) Frequency . The

energy efficiency, measured in MIPS/watt, is proportional to

. This value deteriorates as speculation increases and

complexity grows.

One main goal of microarchitecture research is to design a

microprocessor that can accomplish a group of tasks (applications)

in the shortest amount of time while using minimum

amount of power and incurring the least amount of cost. The

design process involves evaluating many parameters and balancing

these three targets optimally with given process and

circuit technology.

III. MICROPROCESSORS—CURRENT TRENDS AND

CHALLENGES

In the past 25 years, chip density and the associated computer

industry has grown at an exponential rate. This phenomenon

is known as “Moore’s Law” and characterizes almost

every aspect of this industry, such as transistor density,

die area, microprocessor frequency, and power consumption.

This trend was possible due to the improvements in fabrication

process technology and microprocessor microarchitecture.

This section focuses on the architectural and the microarchitectural

improvements over the years and elaborates

on some of the current challenges the microprocessor industry

is facing.

A. Improving Performance

As stated earlier, performance can be improved by increasingIPCand/

or frequencyorbydecreasing the instruction

count. Several architecture directions have been taken to

improve performance. Reduced instruction set computer

(RISC) architecture seeks to increase both frequency and IPC

via pipelining and use of cache memories at the expense of

increased instruction count.Complexinstruction setcomputer

(CISC) microprocessors employ RISC-like internal representation

to achieve higher frequency while maintaining lower

instruction count. Recently, the very long instruction word

(VLIW) [16] concept was revived with the Explicitly Parallel

Instruction Computing (EPIC) [17]. EPIC uses the compiler

to schedule instruction statically. Exploiting parallelism staticallycanenablesimpler

control logicandhelpEPICto achieve

higherIPCandhigher frequency.

1) Improving Frequency via Pipelining: Process technology

and microarchitecture innovations enable doubling

the frequency increase every process generation. Fig. 2

presents the contribution of both: as the process improves,

the frequency increases and the average amount of work

done in pipeline stages decreases. For example, the number

of gate delays per pipe stage was reduced by about three

RONEN et al.: COMING CHALLENGES IN MICROARCHITECTURE AND ARCHITECTURE 329

Fig. 3. Frequency and performance improvements—synthetic model. (Source: E. Grochowski,

Intel, 1997.)

times over a period of ten years. Reducing the stage length

is achieved by improving design techniques and increasing

the number of stages in the pipe. While in-order microprocessors

used four to five pipe stages, modern out-of-order

microprocessors can use over ten pipe stages. With frequencies

higher than 1 GHz, we can expect over 20 pipeline

stages.

Improvement in frequency does not always improve

performance. Fig. 3 measures the impact of increasing the

number of pipeline stages on performance using a synthetic

model of an in-order superscalar machine. Performance

scales less than frequency (e.g., going from 6 to 12 stages

yields only a 1.75 times speedup, from 6 to 23 yields only 2.2

times). Performance improves less than linearly due to cache

misses and branch mispredictions. There are two interesting

singular points in the graph that deserve special attention.

The first (at pipeline depth of 13 stages) reflects the point

where the cycle time becomes so short that two cycles are

needed to reach the first level cache. The second (at pipeline

depth of 24 stages) reflects the point where the cycle time

becomes extremely short so that two cycles are needed

to complete even a simple ALU operation. Increasing the

latency of basic operations introduces more pipeline stalls

and impacts performance significantly. Please note that these

trends are true for any pipeline design though the specific

data points may vary depending on the architecture and the

process. In order to keep the pace of performance growth,

one of the main challenges is to increase the frequency

without negatively impacting the IPC. The next sections

discuss some IPC related issues.

2) Instruction Supply Challenges: The instruction

supply is responsible for feeding the pipeline with useful

instructions. The rate of instructions entering the pipeline

depends on the fetch bandwidth and the fraction of useful

instructions in that stream. The fetch rate depends on the

effectiveness of the memory subsystem and is discussed

later along with data supply issues. The number of useful

instructions in the instruction stream depends on the ISA and

the handling of branches. Useless instructions result from: 1)

control flow change within a block of fetched instructions,

leaving the rest of the cache block unused; and 2) branch

misprediction brings instructions from the wrong path that

are later discarded. On average, a branch occurs every four

to five instructions. Hence, appropriate fetch bandwidth and

accurate branch prediction are crucial.

Once instructions are fetched into the machine they are

decoded. RISC architectures, using fixed length instructions,

can easily decode instructions in parallel. Parallel decoding is

a major challenge for CISC architectures, such as IA32, that

use variable length instructions. Some implementations [18]

use speculative decoders to decode from several potential instruction

addresses and later discard the wrong ones; others

[19] store additional information in the instruction cache to

ease decoding. Some IA32 implementations (e.g., the Pentium

II microprocessor) translate the IA32 instructions into

an internal representation (micro-operations), allowing the

internal part of the microprocessor to work on simple instructions

at high frequency, similar to RISC microprocessors.

3) Efficient Execution: The front-end stages of the

pipeline prepare the instructions in either an instruction

330 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 3, MARCH 2001

Fig. 4. Landscape of microprocessor families.

window [20] or reservation stations [21]. The execution core

schedules and executes these instructions. Modern microprocessors

use multiple execution units to increase parallelism.

Performancegainislimitedbytheamountofparallelismfound

inthe instructionwindow.Theparallelism intoday’smachines

is limited by the data dependencies in the program and by

memorydelaysandresource contention stalls.

Studies show that in theory, high levels of parallelism are

achievable [22]. In practice, however, this parallelism is not

realized, even when the number of execution units is abundant.

More parallelism requires higher fetch bandwidth, a

larger instruction window, and a wider dependency tracker

and instruction scheduler. Enlarging such structures involves

polynomial complexity increase for less than a linear performance

gain (e.g., scheduling complexity is on the order of

O of the scheduling window size [23]). VLIW architectures

[16] such as IA64 EPIC [17] avoid some of this complexity

by using the compiler to schedule instructions.

Accurate branch prediction is critical for deep pipelines in

reducing misprediction penalty. Branch predictors have become

larger and more sophisticated. The Pentium microprocessor

[18] uses 256 entries of 2-bit predictors (the predictor

and the target arrays consume 15 kB) that achieve 85%

correct prediction rate. The Pentium III microprocessor [24]

uses 512 entries of two-level local branch predictor (consuming

30 kB) and yields 90% prediction rate. The Alph

 
"Not answered?"
Get the Answer