Beruflich Dokumente
Kultur Dokumente
3:35 - 5:00
5:00 - 5:15 5:15 - 5:45
IBM DAISY
DAISY Schematic
VLIW Machine
PowerPC
PowerPC Memory
DAISY System
L3 Cache
DAISY VLIW
6xx Bus Memory Controller Memory PCI Bus Disk Video Network Keyboard
PowerPC
DAISY
Most DAISY work uses PowerPC as the source architec But, the DAISY approach is general:
ICS2000 reported how to use DAISY with S/390 as the source architecture. The 1996 DAISY Research Report discussed PowerPC, S/390 and x86 as source architectures. ture.
Optimization Unit
Page:
Used by rst versions of DAISY.
Problems of Page
Cross page boundary, have indirect branch, or other serial Generate code for all reachable paths on page, even those Unnecessary serializations to limit code explosion.
A B C D E
16 paths A-E. To avoid code bloat, page-based serializes after 2 paths with D Bad performance on 14/16 paths.
A B D C B D
A
Tree
C D
At most one reaching denition. Register allocation simplied. Optimization can specialize for each path
E.g. A register may contain a constant along one path.
Exit point of group uniquely species path through group Good feedback info Code explosion can be a problem.
DAISY Operation
DAISY Operation
Stopping Point
DAISY Operation
Stopping Point No
DAISY Operation
Stopping Point No
Yes
DAISY Operation
Stopping Point No
Yes
Interp 30x
DAISY Operation
Stopping Point No
Yes
No
Interp 30x
DAISY Operation
Stopping Point No
Yes
No
No
Interp 30x
DAISY Operation
Stopping Point No
Yes
No
No
DAISY Operation
Stopping Point No
Yes
No
No
DAISY Operation
Stopping Point No
Yes
No
No
DAISY Operation
Stopping Point No
Yes
No
No
No
Yes
Yes
Extending Groups
PowerPC X X+4 X+8 X+12 X+16 X+20 X+64 op1 op2 op3 beq X+64 op4 op5 op6 DAISY VLIW Group op1 op2 op3 beq op4 op5
b TRANSLATOR b X+24 Translation
If beq taken later in execution, TRANSLATOR can extend existing VLIW group, or start another at X+64
Extending Groups
PowerPC X X+4 X+8 X+12 X+16 X+20 X+64 X+68 op1 op2 op3 beq X+64 op4 op5 op6 op7 DAISY VLIW Group op1 op6 op2 op3 beq op4 op5 op7
DAISY Optimizations
ILP Scheduling with data and control speculation Loop Unrolling Alias Analysis Load-Store Telescoping Copy propagation Combining Unication Limited dead code elimination
Load-Store Telescoping
def STORE ... LOAD ... STORE ... LOAD use r1 r1,(X) r2,(X) r2,(Y) r3,(Y) r3
Load-Store Telescoping
def STORE ... LOAD ... STORE ... LOAD use r1 r1,(X) r2,(X) r2,(Y) r3,(Y) r3 def use STORE ... LOAD ... STORE ... LOAD r1 r1 r1,(X) r2,(X) r2,(Y) r3,(Y)
DAISY registers r0-r31 always contain the values in Pow DAISY registers r36-r63 are used for renaming speculative DAISY register r32 has PowerPC counter value. DAISY register r33 has PowerPC linkreg value. DAISY register r35 contains the constant 0.
On PowerPC memory accesses, using r0 as an address means literal 0. Keeping 0 in r35 simplies renaming of r0 in some cases. results during scheduling, and as scratchpads. erPC registers r0-r31.
Original PowerPC Code 1) add 2) bc 3) sli 4) xor 5) and 6) bc 7) b 8) L1: sub 9) b 10) L2: cntlz 11) b r1,r2,r3 L1 r12,r1,3 r4,r5,r6 r8,r4,r7 L2 OFFPAGE r9,r10,r11 OFFPAGE r11,r4 OFFPAGE
Original PowerPC Code 1) add 2) bc 3) sli 4) xor 5) and 6) bc 7) b 8) L1: sub 9) b 10) L2: cntlz 11) b r1,r2,r3 L1 r12,r1,3 r4,r5,r6 r8,r4,r7 L2 OFFPAGE r9,r10,r11 OFFPAGE r11,r4 OFFPAGE
Original PowerPC Code 1) add 2) bc 3) sli 4) xor 5) and 6) bc 7) b 8) L1: sub 9) b 10) L2: cntlz 11) b r1,r2,r3 L1 r12,r1,3 r4,r5,r6 r8,r4,r7 L2 OFFPAGE r9,r10,r11 OFFPAGE r11,r4 OFFPAGE
Translated VLIW Code VLIW1: add r1,r2,r3 bc L1 b VLIW2 VLIW2: sli r12,r1,3
Original PowerPC Code 1) add 2) bc 3) sli 4) xor 5) and 6) bc 7) b 8) L1: sub 9) b 10) L2: cntlz 11) b r1,r2,r3 L1 r12,r1,3 r4,r5,r6 r8,r4,r7 L2 OFFPAGE r9,r10,r11 OFFPAGE r11,r4 OFFPAGE
Translated VLIW Code VLIW1: add r1,r2,r3 bc L1 xor r63,r5,r6 b VLIW2 VLIW2: sli r12,r1,3 r4=r63
Original PowerPC Code 1) add 2) bc 3) sli 4) xor 5) and 6) bc 7) b 8) L1: sub 9) b 10) L2: cntlz 11) b r1,r2,r3 L1 r12,r1,3 r4,r5,r6 r8,r4,r7 L2 OFFPAGE r9,r10,r11 OFFPAGE r11,r4 OFFPAGE
Translated VLIW Code VLIW1: add r1,r2,r3 bc L1 xor r63,r5,r6 b VLIW2 VLIW2: sli r12,r1,3 r4=r63 and r8,r63,r7
Original PowerPC Code 1) add 2) bc 3) sli 4) xor 5) and 6) bc 7) b 8) L1: sub 9) b 10) L2: cntlz 11) b r1,r2,r3 L1 r12,r1,3 r4,r5,r6 r8,r4,r7 L2 OFFPAGE r9,r10,r11 OFFPAGE r11,r4 OFFPAGE
xor r63,r5,r6
b VLIW2 b OFFPAGE VLIW2: sli r12,r1,3 r4=r63 and r8,r63,r7 bc L2 cntlz r11,r63 b OFFPAGE b OFFPAGE
DAISY Speculation
DAISY aggressively speculates operations: Control Speculation: Operations Above Branches Data Speculation: Loads above possibly aliased stores.
Loads moved above possibly aliased stores have a load Load-verify reloads the value and checks if it is the same Yes Speculation ok execution continues normally. No Speculation bad Trap to VMM and recover.
as the speculatively loaded value. verify instruction inserted at the in-order location of the original load.
Speculative values in non-PowerPC regs can be thrown Execution can resume with using the in-order value of the If a particular load-verify fails frequently, the VMM can Result: Simple alias analysis by translator sufces.
load. retranslate the block of code, so as not to speculate this load. away.
Load-Verify Patent
On Miss:
Check WIMG Bits associated with each PowerPC Page. (Translation off uses WIMG=0011). I = Caching Inhibited Page. G = Guarded Storage. I=1 and G=1 I/O Access. Speculative I/O Access Quash.
Branch directly:
Fast. VMM must track invalidations of translated groups and x other groups that branch to them.
Cross group branches are looked up in a special LVIA cache. LVIA = Load VLIW Instruction Address.
lvia rX,(rY) b rX rY= PPC Code Addr LVIA Cache VLIW Addr =rX
The LVIA cache is backed up by a larger memory list of translations akin to page tables. If no translation exists, the LVIA op returns the address of the translator.
If there are multiple operations in a VLIW instruction and The entire VLIW instruction is treated as unexecuted. The DAISY VMM then interprets the PowerPC operations It then invokes a translation of the PowerPC exception handler. one causes a synchronous exception,
corresponding to the non-speculative operations in the VLIW instruction to nd the exact point of the exception.
VMM knows where to nd PowerPC operations corresponding to a VLIW instruction as follows: VPA Register with PowerPC Effective Address of Page (20 bits). When translating from PowerPC to VLIW, VMM updates side table with 10 bits for offset in PowerPC page corresponding to start of each VLIW instruction. Combining VPA and offset gives the full effective address of the PowerPC code corresponding to this VLIW. VMM translates this effective address to a real address. When translated VLIW code spans a PowerPC page boundary, it must contain an operation to update the VPA register and verify that the real address of the next page remains unchanged.
DAISY VPA
20 bits VPA
Updated when translated VLIW code passes page boundary in PowerPC code.
False
cr15.eq True
mr
r3=r60 cr8.eq
False b V3 IRUPT V3: Restart at 0x0C cmpi mr cmpi b V4 cr0=r4,0 r5=r63 cr15=r62,0 goto TIP_7 (Start interpreting at 0x1C)
True goto TIP_6 (Execute translated code for 0x1234, the return address)
The PowerPC architecture requires that an Instruction Cache DAISY uses ICBI as a signal to invalidate any translations Keeping track of translations on a 32-byte block basis consumes a great deal of memory: 512 Million 32-byte blocks in 4 GBytes of memory. Even with 1 bit per block, 64 MBytes needed. DAISY keeps track on 4KByte page basis. included in the invalidated block. Block Invalidate (ICBI) op occur whenever code is modied.
Many architectures like x86 and S/390 do not have ICBI Such architectures need hardware bits on each chunk of When a translation is made of a chunk, the bit is set. If anything writes to a chunk of memory with a bit set, an
memory (invisible to the source architecture). like instructions.
exception occurs, alerting the VMM to invalidate translations from that chunk.
Dynamic Translation of realtime code is problematic. Known realtime code such as exception handlers can be In general, such code cannot be known ahead of time. The rst time realtime code is encountered, it executes slowly Result: Variability and delay that may be unacceptable for
realtime. with the overhead of interpretation or translation. pretranslated.
In booting AIX under DAISY dynamic translation simula However, the kerberos system requires the system time on
a machine to match that of the server. (kerberos system authentication expires after a certain time, and with unsynchronized systems, this becomes problematic.) tor, there were no realtime problems with disks, network controllers, keyboard or mouse input, or any other hardware.
Management of Translations
Throw away all translations when cache full. LRU or Generational Garbage Collection.
DAISY. Must throw out translations when invalidated by program. Must throw out translations when translation cache is full. Must throw out translations when tables indexing translation are cache full. Dynamo.
DAISY Performance
|
1: Page 2: Tree / 8 Issue / Hware Commits
10
| | | | | | | | | | | | tpcc
5: Tree / oo Issue
vortex
perl
li
jpeg
compress
88k
16 Issue 16 ALUs 8 Load/Store Units 3 Branches per Cycle 64 Integer Registers Clustered
4 Clusters of 4 ALUs 2 Load/Store Units per Cluster Immediate Bypass within Cluster Extra Cycle Outside Cluster
Infinite Resource
COMPRESS
TPC-C
M88K
PERL
GO
GCC
COMPRESS
ICache CPI
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 LI IJPEG -0.1 VORTEX TPC-C M88K PERL GO GCC ICache Finite Resource Infinite Resource
COMPRESS
DCache CPI
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 LI IJPEG -0.1 VORTEX TPC-C M88K PERL GO GCC DCache ICache Finite Resource Infinite Resource
COMPRESS
DTLB CPI
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 LI IJPEG -0.1 VORTEX TPC-C M88K PERL GO GCC DTLB DCache ICache Finite Resource Infinite Resource
COMPRESS
COMPRESS
Instruction Reuse
80 70 60 Millions 50 40 30 20 10 0 LI M88K IJPEG VORTEX PERL COMPRESS GO GCC TPC-C
VORTEX
COMPRESS
TPC-C
M88K
PERL
GO
GCC
Counts
0.0e+09 | 0
Firmware Ends
| | |
Boot Ends
| 1200
| 1500
| 1800
| 2100
| 2400
Counts
11000000 10000000 9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000000 1000000
0| 0
| | | | | ICBIs 1000 * Non-NOP ICBIs | | | | | | | 300 | 600 | 900 | 1200 | 1500 | 1800 | 2100 | 2400 |
Firmware Ends
Boot Ends
Occurences 1E+0
# of ISI Exceptions -- Page Faults -- Memory Protection # of DSI Exceptions -- Page Faults -- Memory Protection # of Alignment Exce # of Program Excepti -- Traps -- Illegal -- Privileged -- IEEE Float # of System Calls
1E+1
1E+2
1E+3
1E+4
1E+5
1E+6
Transmeta Crusoe
Full x86 compatibility, including all system, I/O and other Performance near x86 implemenations from other vendors. Low Power.
code.
Crusoe Schematic
x86 Applications
Windows, Linux, ... BIOS
x86
Similarities to DAISY
Full System Translation. VLIW designed specially for emulation. Interpretation and Translation Used.
Crusoe Processor
Up to 4 atoms (ops) per molecule 128-bit Molecule
FADD
ADD
LD
BRCC
Integer ALU #0
Branch Unit
In order pipeline 64 Registers Working Shadow For x86 Use For Crusoe Use
TM5400
Shadow GPRs Debug Reg Alias Tbl 64 GPRs TLB 32 FGRs Shadow FGRs LongRun Power Management
ALU0
ALU1
Load/Store
Branch
FPU/Media PCI Control and and Interface ("North Bridge") 32 PCI Bus
64
ICache Ctl
64
DDR SDRAM
TM5400 has 8 KBytes of onchip local memory for data and These presumably allow the most critical parts of CMS to The high 16-way associativity of the L1 DCache acts to
minimize conicts between data used by the x86 code and that used by CMS. remain resident for quick access and without perturbing x86 code and data. 8 KBytes of onchip local memory for instructions.
Memory controller integrated because part of the PC archi In DAISY, we found the PowerPC rmware powered off Since the DAISY VMM was in these DRAM banks, this By having the memory controller under Crusoe control,
power-off killed DAISY. Transmeta should be able to avoid problems like this DRAM power-off. DRAM banks for testing during system startup. tecture.
Same condition code format as traditional x86 processors. Same 80-bit oating point format as traditional x86 proces 7 stage integer pipeline.
Pentium IV has a 20 stage pipeline. sors.
Crusoe Startup
512 KByte FlashROM CMS
Crusoe Startup
512 KByte FlashROM CMS Decompress RAM 2 MBytes of Crusoe Code
Crusoe Startup
512 KByte FlashROM CMS Decompress RAM 2 MBytes of Crusoe Code
Crusoe Startup
512 KByte FlashROM CMS Decompress RAM 2 MBytes of Crusoe Code 8 KB Crusoe Local Ins Mem 8 KB Crusoe Local Data Mem
On exception, purge and reset store buffer of entries made by current code fragment.
Note: Size of Gated Store Buffer limits number of stores in translated group.
Commit and rollback allows a wider variety of optimiza Removal of loop invariants. Strength reduction. Better dead code elimination.
tions than DAISY which can rollback only to the start of an individual VLIW instruction.
(Hardware)
AB
(Store Under Alias Mask)
Addr
Size
stam (Y)
(Hardware)
No Continue Execution
Aliasing
Speculative load from untaken path aliased with store on taken path. A false aliasing problem in straight line code?
Original Code
Translated Code
st (W)
... (X) ldp (Z) ... (W) stam ... (Y) stam
ldp
X=Y
Some concepts of Transmetas speculative load scheme appear similar to those in other approaches: U.S. Patent 5542067: Method and apparatus for improving performance of out of sequence load operations in a computer system. Inventors: Kemal Ebcio lu, Manoj Kumar, Eric Krong stadt. U.S. Patent 5625835: Method and apparatus for reordering memory operations in a superscalar or very long instruction word processor. Inventors: Kemal Ebcio lu, Dave Luick, Jaime Moreno, g Gabriel M. Silberman, P.B. Wintereld.
The rst few times a specic x86 code sequence is exe If [CMS has] been through an area 50 times, we gure we
can optimize that and make it run faster. cuted, Code Morphing interprets the code
Code morphing translates the x86 instructions into highly Executes the code, and Caches the native instruction translations for future use.
optimized and extremely fast VLIW native instructions,
LongRun
power management software slows chip frequency ( ) and voltage ( ) when processing does not require full CPU.
Optimization removes many operations such as condition Simple, in-order hardware. Complexity is in software.
code setting.
Partial register results, e.g. a write to 8-bit ah modies the x86 16 bit code and segment registers.
top 8 bits of ax.
Dynamo
Dynamo Schematic
HPUX Applications
Dynamo Software
HPUX PA-RISC
Extensive Optimization. Interpretation and Translation Used. Source Architecture code unmodied.
Dynamo Goal
to use a piece of software to improve the performance of a native, statically optimized program binary, while it is executing.
Dynamo Characteristics
Written entirely in user-level software. Does not depend on any special programming language, The program binary is
operation. compiler, operating system, or hardware support. left untouched during Dynamos
Dynamo Operation
No Lookup Branch Target in Cache Hit Jump to Top of Fragment in Cache Context Switch Miss Start-of-Trace Condition? Yes Increment Counter Associated with Branch Target Addr No Counter Value Exceeds Hot Threshold? Yes Interpret+Codegen Until Branch
Signal Handler
OS Signal
Fragment Cache
Emit into Cache, Link with Other Fragments & Recycle the Associated Counter
Yes
End-of-Trace Condition?
No
DAISY Operation
Stopping Point No
Yes
No
No
Traces are the unit of Dynamo optimization. A trace is a single path through the code, similar to su A trace may go through:
Indirect branches. Function Calls and Returns. Virtual Function Calls. perblocks.
Two pass optimizer: Forward and Backward. Eliminate unconditional branches since (Traces straightened). Remove Call-Return pairs in a trace. Copy propagation, constant propagation, strength reduction, Fragment linking, i.e. direct branching between translated
groups. improvement compared to always branching via the Dynamo software. loop invariant code motion, loop unrolling.
Pre-emptive Flushing
Complex TCache management schemes can be time con Dynamo looks for sudden jumps in trace creation rate.
Then the entire TCache is ushed. Such ushing allows Dynamo to readapt to new code patterns. It also allows for a quick reset of many data structures and memory pools without costly garbage collection. suming.
Dynamo handles all signals. Signal handlers in the emulated program are translated by
Dynamo like other code. Question: What happens if a program running under Dynamo installs its own signal handler at some random point?
Like DAISY, Dynamo waits for a code fragment to end be For synchronous signals, Dynamo proposes a scheme to However, the current implementation uses an aggressive /
These optimizations include dead code elimination, and removal of loop invariants. use logs, so as to be able to backtrack and report precise PA-RISC state to the emulated signal handler. conservative scheme, where aggressive optimizations are turned off if precise exceptions are needed. fore notifying the emulated program and its signal handler of the exception.
Systems like DAISY and Crusoe that translate to a differ Dynamo can bail out. When Dynamo detects that translated fragments are being
created at too high a rate, it does bail out. can cause for a program. of a program. ent target architecture than source architecture cannot bail out.
Bailing out limits the amount of slowdown that Dynamo Caveat: After bailing out, Dynamo cannot regain control
Dynamo Performance
Dynamo Overhead
1.2
0.8
0.6
0.4
0.2
1038
400
300
200
100
Dynamo Speedup
25 Speedup Relative to -O2 (Percent) Aggressive 20 Conservative No Optimization
15 10
5 0
Opcodes similar to source architecture Condition code setting similar to source architecture Address translation similar to source architecture Floating point format similar to source architecture More registers than source architecture I/O system, mem controller similar to source architecture Memory map consistent with source architecture Able to hide part of real memory from source architecture Timers consistent with source architecture Load-Verify Instruction or other way to speculate loads.
for use by VMM
Can a binary translation machine have generally better per Can all realtime problems be avoided? What memory management schemes are best? Transparently increasing or decreasing amount of memory
for VMM after system booted. formance than a well-designed superscalar?
Summary
There are many approaches to dynamic translation and op The technology is maturing and proving itself. Faster, better optimization is still needed.
timization.
5:00 - 5:15
5:15 - 5:45
Break
Kemal Ebcioglu: LaTTe