#### IA-64 Architecture and Its Perform ance Hsin-Ying Lin hsin-ying\_lin@ hp.com Kevin W adleigh kevin\_wadleigh@ hp.com #### So, W hat is IA-64? #### Most Significant Architecture Since 80386 - 64-BitArchitecture (PostRISC, 32-Bit) - Explicitly Parallel Instruction Computing (EPIC) - Comprehensive Predication - Enhanced Speculation ## The IA-64 Advantages #### Performance Optimized - Breakthrough Perform ance for W orkstation and Server Applications - Multi-Platform Support - Delivering Next-Generation Computing Today #### High-End Application Support - E-services - TechnicalCom puting - Business Intelligence #### 7 Today's Architecture Challenges: - M em ory Latency - Branch M isdirects - Too Few Registers - Hardware-Based Instruction Scheduling - M em ory Addressing Efficiency - Hardware, I/O Capacity Scalabilit #### HP's IA-64 TargetCustomers • Network Security - Memory Addressing Efficiency - Breakthrough OLTP Perform ance • Floating-Point Mathematics Hsin-Ying Lin and Kevin Wadleigh MSW, TCD mical #### W hat are the IA-64 Custom er Benefits? Hsin-Ying Lin and Kevin Wadleigh MSW, TCD ### HP's Binary Compatibility Advantage: Itanium & HP-UX, W indows NT, and Linux Hsin-Ying Lin and Kevin Wadleigh MSW, TCD #### $\mathbb{A}$ -64 - Architecture resources - Predication - Register rotation - Speculation - Processors - Perform ance Tuning for ISV #### Machine Resources **(**P) #### Instruction Bundling - 128-bitaligned instruction bundles contain - three 41-bit instructions - 5-bit tem plate consisting of 4-bit dispersal tem plate + 1 stop bit - Branches are to bundle boundaries - Im plem entations are allowed to have any num beroffunctional units - Template controls dispersal to functional units: Memory, Integer, Floating-point, Branch, Long immediate | | | | template | |--------|--------|--------|----------| | slot 2 | slot 1 | slot 0 | disp s | #### Tem plates and Dispersal #### Templates: M L X M I I M I / I M M I M M I M M F M M B M F I M F B M I B M B B B B B / = stop bit Each template is available with stop bit at end Dispersal maps the instructions to functional units. This example shows a CPU that can perform at least two M units, two F units, an I unit and a B unit in one cycle. (Itanium can perform 2M, 2I, 2F, 3B) 49 Hsin-Ying Lin and Kevin Wadleigh MSW, TCD #### Parallelism – Code example y = x + y #### • Instruction stream #### • Maps to M nop I;; //M M I nop m F nop I;; //M F I M nop I nop I //M II #### Predication - rem oves branches - Converts a controldependence to a data dependence - Compare instructions setpredicate bit - Predicated instructions are either normally executed or they do not affect the architectural state - example code below ``` if (ix .eq. iy) then a = 0 else c = 0 end if ``` • Becomes ``` cm p.eq p6 p7 = r16 r17 ;; (p6) fadd d f4 = f0 f0 (p7) fadd d f5 = f0 f0 ``` • Maps to ``` nop M I nop I;; nop M F nop I nop M F nop I ``` ${\tt Hsin-Ying Lin}$ and ${\tt Kevin W}$ ad ${\tt ligh MSW}$ , ${\tt TCD}$ #### Software Pipelining - Traditional architectures use bop unrolling to hide latencies - High overhead: extra code for bop body, probgue, and epibgue - Synergistic use of IA-64 features allows efficient pipelining - Specialbranches cause registers to rotate - Register rotation rem oves need for explicit unrolling - Predicate rotation rem oves probgue & epibgue #### RegisterRotation - Key to good bop perform ance - software pipelining uses register rotation - acts like short vectors - with each iteration of a bop, data in rotation registers m oves to the next register in the set • This exam ple om its predication necessary for problem and epiboue Hsin-Ying Lin and Kevin W adleigh M SW , TCD ## Software Pipelining using Rotation and Predication • DAXPY inner bop ``` for (i = 0; i < 3; i++) dy[i] = dy[i] + (da * dx[i]); (2 bads, 1 fm a, 1 store per iteration) ``` - Considera hypothetical processor than can perform - 2 bads, 1 fm a, 1 store per iteration - bad htency of 2 cycles - fm a latency of 1 cycle - (Itanium can perform: 2M, 2I, 2F, 3B percycle) #### Example: Pipeline Each column represents 1 source iteration #### Example Code ``` .rotf dx[3], dy[3], tmp[2] // short vectors ar.lc = 2 mov // lc = loop count // = #iterations-1 // epilogue count ar.ec = 4 mov // #stages (or # pred) pr.rot = 0x10000 // p16=1, p17=p18=...=0 mov ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp], 8 (p18) fma.d tmp[0] = da, dx[2], dy[2] (p19) stfd [dydp] = tmp[1],8 br.ctop looptop ;; ``` #### **Execution Sequence** (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st **Predicates** 16: 17: 18: 0 19: 0 Loop 1 LC=2 EC=4 #### **Execution Sequence** (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st ``` Predicates 16: 1 17: 1 18: 0 19: 0 ``` Loop 2 LC=1 EC=4 #### **Execution Sequence** Predicates 16: 1 17: 1 18: 1 19: 0 (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st Loop 3 ## **Predicates** 16: 17: 18: 19: #### **Execution Sequence** $(p16) Id_x (p16) Id_v (p18) fma (p19) st$ (p16) $Id_x$ (p16) $Id_y$ (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st **Epilogue 1** Predicates 16: 0 17: 0 18: 1 19: 1 (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st (p16) Id<sub>x</sub> (p16) Id<sub>y</sub> (p18) fma (p19) st **Execution Sequence** **Epilogue 2** # Predicates 16: 0 17: 0 0 1 18: 19: **Execution Sequence** **Epilogue 3** | Predica | ates | |---------|------| | 16: | 0 | | 17: | 0 | | 18: | 0 | | 19: | 0 | #### **Execution Sequence** ``` (p16) \, Id_x (p16) \, Id_y (p18) \, fma (p19) \, st (p16) \, Id_x (p16) \, Id_y (p18) \, fma (p19) \, st (p16) \, Id_x (p16) \, Id_y (p18) \, fma (p19) \, st (p16) \, Id_x (p16) \, Id_y (p18) \, fma (p19) \, st (p16) \, Id_x (p16) \, Id_y (p18) \, fma (p19) \, st (p16) \, Id_x (p16) \, Id_y (p18) \, fma (p19) \, st (p16) \, Id_x (p16) \, Id_y (p18) \, fma (p19) \, st (p16) \, Id_y (p18) \, fma (p19) \, st ``` Done #### Pipeline and Latency load dx,dy → - Suppose we change to latencies to - bad htency of 6 cycles - fm a latency of 4 cycles - Each column represents 1 source iteration $$tmp = dy + da * dx \longrightarrow$$ store dy #### Updated Loop ``` .rotf dx[7], dy[7], tmp[5] mov ar.lc = 2 // #iterations-1 // #stages mov ar.ec = 11 mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp], 8 (p16) ldfd dy[0] = [dysp],8 (p22) fma.d tmp[0] = da, dx[6], dy[6] (p26) stfd [dydp] = tmp[4],8 br.ctop looptop ;; ``` #### Rotation: Sum mary - Loop pipelining maxim izes performance; minim izes overhead - Avoids code expansion of unrolling and code explosion of probgue and epibgue - Smalercode means fewercache misses - Greaterperform ance in provem ents in higher latency conditions - Reduced overhead allows S W pipelining of small bops with unknown trip counts - Typicalofintegerscalarcodes #### Speculation - Memory is very faraway, so we would like to bad data wellbefore its use - Prefetch instructions willnot prefetch pages that have not been mapped by the TLB - Prefetch instruction willnot prefetch data from invalid addresses - Speculative bads allow users to try to bad data from addresses regardless of whether or not the data will be used, the address will be written to in the meantime, or the address is known to be valid. What could go wrong? - Controlspeculation versus data speculation - Control-m oves bads around branches - Data -m oves bads around stores # ControlSpeculation Move Loads before Branches #### ControlSpeculation Regular bads are replaced with speculative bad, followed by speculative chk instruction ``` • d is replaced by ds, chks ``` bf is replaced by bfs, chks - Forsafety, special values are used for illegal returns - Integer bads set the Nota Thing (NaT) bit associated with the target general register - Floating-point bads set the target floating-point register to a special value: NaTVal= 0,0x1FFFE,0... 0 #### NaT ('Nota Thing") and NaTVal - NaT (orNaTVal) indicates: - whetherornotan exception has occurred - If NaT (or NaTVal) setduring bls (blfs), it is checked by the instruction chks (usage: chks reg, target), then branch to target - code attarget can redo the bad and take the norm al exception Hsin-Ying Lin and Kevin W adleigh MSW ,TCD # Data Speculation Move Loads before Stores #### Data Speculation - Moves bads around possibly conflicting stores - Regular bads are replaced with advanced bads, followed by either a check bad or advanced chk instruction - If the only instruction that was am biguous is the bad, then a check bad can be performed after the bad ■ b is replaced by da, d.c.cr df is replaced by bfa, bfc.ch • If there are several instructions that depend on the advanced bad, then a chk a can be used to branch to fix up code b is replaced by lda,chka ■ bdf is replaced by blfa, chka #### Data Speculation - example • If the only instruction that was am biguous is the bad, then a check bad can be performed after the bad st $$[r4] = r12$$ • Becomes ••• st $$[r4] = r12$$ #### ProcessorEvolution # HP M improcessorRoadmap July 6,2000 # Performance Tuning for ISV ### Characteristics of ISV Application One of our commercial ISV application involves a lot of floating computation. On their benchmark suite, over 50% of the computation time was concentrated in about 25% of the routines. Furthermore, about 40% of computation time actually spend in two kernels, WAXPY and DOTPRODUCT. #### W VAXPY & C Code ``` wvaxpy(w, x, y, n, alpha) double *w, *x, *y, alpha; int n; { while( n-- > 0 ) *(w++) = *(x++) +alpha* *(y++); } ``` # A-64 CompilerGenerate Code for WAXPY—Ideally | Instructions | Template | Clocks on | |-------------------------------|----------|-----------| | | | Merced | | | | L1 | | L L x+ S L x+ | MMF MMF | 2 | | LS-LLx+ | MMI MMF | 2 | | S L x+ L S - | MMF MMI | 2 | | L L x+ S L x+ | MMF MMF | 2 | | LS-LLx+ | MMI MMF | 2 | | S L x+ LFS - | MMF MMI | 2 | | L LF - LF - B | MMI MIB | 2 | | Total clocks for 8 iterations | | 14 | | Clocks per iteration | | 1.75 | Note: LF indicates for prefetch instruction # A-64 CompilerGenerate Code for WAXPY | Instructions | Template | Clocks on | |-------------------------------|----------|-----------| | | | Itanium | | | | L1 | | LL- LL- | MMI MMI | 2 | | LL- LL- | MMI MMI | 2 | | LF - x+ x+ | MMF MMF | 2 | | LF - x+ x+ | MMF MMF | 2 | | LF LF - S S - | MMI MMI | 2 | | LF LF - | MMI | 1 | | LF LF - S S B | MMI MMB | 2 | | Total clocks for 4 iterations | | 13 | | Clocks per iteration | | 3.25 | Note: LF indicates prefetch instruction # Assembly Code Generated by Compiler for WAXPY ``` ..L11: (p16) Ifetch.nt1 [r17], 8 // M (p16) Idfd // M f44 = [r11], 64 // M nop.m 0 (p16) Idfd f47 = [r10], 32 // M (p17) fma.d.s0 f45 = f8, f35, f39 // F nop.i 0 // I nop.m 0 // M (p16) Idfd // M f34 = [r9], 32 nop.m // M f48 = f8, f33, f37 ; // F (p17) fma.d.s0 (p16) Idfd // M f32 = [r8], 32 (p16) Ifetch.nt1 [r11], 8 // M 0 ;; // I nop.i (p16) Ifetch.nt1 [r17], 8 // M (p16) Idfd f40 = [r17], 64 // M // I nop.i (p16) Idfd // M f42 = [r16], 32 stfd [r19] = f46, 16 // M (p18) // M (p18) stfd [r18] = f49, 16 // [ nop.i ;;// I nop.i 0 (p16) Idfd f38 = [r15], 32 // M (p16) Ifetch.nt1 [r11], 8 // M (p16) Idfd f36 = [r14], 32 // M [r17], 8 // M (p16) Ifetch.nt1 ;; // I :: // I nop.i 0 nop.i 0 (p16) Ifetch.nt1 [r11], -56 // M (p16) Ifetch.nt1 [r11], 8 // M (p16) Ifetch.nt1 [r17], -56 // M // M nop.m 0 nop.i // [ (p17) fma.d.s0 f50 = f8, f45, f41 // F // M (p17) stfd [r19] = f50, 16 // M 0 [r18] = f51, 16 // M stfd nop.m (p17) br.ctop.dptk.few ..L11 ;; // B // M nop.m (p17) fma.d.s0 f51 = f8, f48, f43 ;; // F ``` ### Hand Tuned A-64 W AXPY Assembly Code | Instructions | | | | | | Template | | Clocks on | |------------------------------------|----|----|----|---|----|----------|-----|-----------| | | | | | | | | | Itanium | | | | | | | | | | L1 | | LP | S | X+ | LP | S | X+ | MMF | MMF | 2 | | LP | S | X+ | LP | S | X+ | MMF | MMF | 2 | | LP | S | X+ | LP | S | X+ | MMF | MMF | 2 | | LP | S | X+ | LP | S | X+ | MMF | MMF | 2 | | LF | LF | _ | LF | _ | В | MMI | MIB | 2 | | Total clocks for 8 iterations | | | | | | 10 | | | | Clocks per iteration | | | | | | 1.25 | | | | Speedup ratio of HLL/Assembly | | | | | | 1.4 | | | | Speedup ratio of Compiler/Assembly | | | | | | 2.6 | | | Note: LF indicates prefetch instruction; LP m eans quad word bad #### **WVAXPY Assembly vs C Code's Performance** on IA-64 Itanium 499 MHz tccp -- +O2 +Onoparmsoverlap +Odataprefetch Hsin-Ying Lin and Kevin W adleigh MSW, TCD ### WVAXPY Assembly vs C Codes' Speedup on Itanium 499 MHz CPU Hsin-Ying Lin and Kevin Wadleigh MSW ,TCD #### DOTPRODUCT & C Code ``` double dotproduct(a, b, n) double *a, *b; int n; double dot; dot = 0.; while( n-->0) dot +=*(a++)**(b++); return(dot); ``` # IA-64 CompilerGenerate Code for DOTPRODUCT | Instructions | | | tions | Template | | Clocks on Itanium | |-------------------------------|---|----|---------|----------|-----|-------------------| | L | L | X+ | LF LF I | MMI | MMI | 2 | | M | M | В | | MMB | | 1 | | Total clocks for 1 iterations | | | | | | 4 | | Clocks per iteration | | | | | | 4 | Note: The floating point latency determ ines the rate. ### IA-64 DOTPRODUCT Assembly Code Generated by Compiler ``` .L5: (p16) 132 = [18], 8 //M [line/col7/20] (p16) dfd f35 = [r14],8 //M [line/col7/20] (p18) fmads0 f8 = f37, f34, f8 //F (p16) lfetch ntl [r9], 8 //M lfetchntl [rl0],8 (p16) //M nopi ;; // I /\!/\mathrm{M} nop m //M nop m brctopdptk.few .L5 ;;//B [line/col7/11] ``` # A-64 CompilerGenerated Code for DOTPRODUCT — Ideally | Instructions | | | | | | Template | | Clocks on Itanium | |-------------------------------|----------------------|-----|---|---|----|----------|-----|-------------------| | L | L | X+ | L | L | X+ | MMF | MMF | 2 | | L | L | X+ | L | L | X+ | MMF | MMF | 2 | | L | L | X+ | L | L | X+ | MMF | MMF | 2 | | L | L | X+ | L | L | X+ | MMF | MMF | 2 | | LF | LF | : B | | | | MMB | | 1 | | Total clocks for 8 iterations | | | | | 9 | | | | | Clo | Clocks per iteration | | | | on | | | 1.13 | # Hand Tuned A-64 DOTPRODUCT Assembly Code | Instructions | Tem | plate | Clocks on Itanium | |-----------------------------|-----|-------|-------------------| | | | | L1 | | LP x+ I LP x+ I | MFI | MFI | 1 | | LP x+ I LP x+ I | MFI | MFI | 1 | | LP x+ I LP x+ I | MFI | MFI | 1 | | LP x+ I LP x+ I | MFI | MFI | 1 | | LF LF B | MMB | | 1 | | | | | | | Total clocks for 8 iter | 5 | | | | <b>Clocks per iteration</b> | 0.6 | | | | Speedup ratio of HLL | 1.8 | | | | <b>Speedup ratio of Com</b> | 6.4 | | | **(** Hsin-Ying Lin and Kevin Wadleigh MSW,TCD ## **DOTPRODUCT Assembly Codes vs C Codes** on Itanium 499 MHz cc\_d: compiled with +O2 +Onopaumsoverlap +Odataprefetch ## **DOTPRODUCT's Assembly Code vs C Code** on Itanium 499 MHz CPU C code compiled with + Q2 + Onoparms overlap + Odataprefetch # IA-64 Assembly code Vs C code on Itanium Speedup In-Cache Out-of-Cache **WAXPY** 2.8x 2.2x **DOTPRODUCT 8.0x** 6.5x O verall speedup on ISV application suite = 1.3x(Estimation) Note: C code is compiled with +O2 +Onoparmsoverlap +Odataprefetch **4** # Perform ance Tuning Sum mary - We estimate that we will improve this ISV applications performance on IA-64 platforms by 30% - We will work closely with the ISV R&D team to ensure that the ISV's customers will enjoy performance improvements on HP platforms in the near future # Backup Slides ### Where we're going - 1A64 - An EPIC story, years in the making - HP and Intelipintly designed instruction set - Now It can be to Id - IntelIA-64 hom e page <a href="http://developer.intel.com/design/ia-64">http://developer.intel.com/design/ia-64</a> - Recom m ended articles: - NextGeneration Instruction SetArchitecture (Craw ford, Huck) <a href="http://developer.intelcom/design/ia-64/next/index.htm">http://developer.intelcom/design/ia-64/next/index.htm</a> - Tanium ProcessorM irroarchitecture O verview '(Sharangpani) http://developer.intel.com/design/ia-64/m irroarch\_ovw - The complete (>500 pages) 4 volume The IA-64 Architecture Software Developer's Manual' http://developer.intel.com/design/ia-64/manuals ### Chip production costs peryear How many proprietary RISC vendors can continue to invest? (...or, alternatively, face higher chip costs for low volumes with a third party fab?) **4** ## M improcessor Production Capacity Especially when fabrication and design costs must be recouped against relatively small unit volumes compared with merchants... ${\tt Hsin-Ying Lin}$ and ${\tt Kevin}$ W adleigh M SW , TCD ### W here we've been | Processor | CISC | Vector | RISC | LIW | |-------------------|-----------------|----------|-----------------|---------------------| | families | (Complex | | (Reduced | (Long Instruction | | | Instruction Set | | Instruction Set | Word) | | | Computing) | | Computing) | Example: EPIC - | | | | | | Explicitly Parallel | | | | | | Instruction | | | | | | Computing | | Architecture | IA-32 | C-Series | PA-RISC, | IA-64 | | (Instruction Set) | | | MIPS, | | | | | | Alpha | | | | | | 7 HpHa | | | Processors | Pentium III | C-4 | PA-8500, | Itanium, | | F10C688018 | | C-4 | , | , | | | Xeon | | R12000, | McKinley | | | | | Alpha21264 | • | | | | | 1 11pma2 1 20+ | | (P) #### IA-64 Public Information #### • Itanium - Multiple configurations for servers and w/s - Production in m id-2000 - 0.18m CMOS technology - 4 DP Fbps/cycle 3 G fbp/s peak - Three evelcache hierarchy (64-byte line size) - -L0: separate instruction and data - -L1:unified cache on die - -L2:offdie,2 or4 MB #### • McKinby - C bck > 1GHz, increased num ber of execution units, on die L2 cache - Increased bus bandwidth - Targetproduction: late 01 #### • Madison - M cK in by follow on - Perform ance optim ized on 0.13m technology #### • Deerfield - McKinley follow-on - Price/perform ance optim ized on 0.13m technology