The 64-Bit technology introduces
several new and complex tasks for software-developers. Even that the hardware
developing industry claims that future software development should take care
about the new introduced compiler systems, it is necessary to have a deep
inside view on how the new underlying 64-Bit assembly language works. This
paper describes what 64-Bit means for future software developments, how 64-Bit
influences assembly programming and how to port applications programmed under
32-Bit to 64-Bit. It is heavy based on the IA architecture and the Windows® operating
system.
Introduction
The
introduction of the 64-Bit technology leads into several discussions if this
change is really necessary [1]. Some arguments are to speed up
computation time or to make calculations more precise. These efforts lead into
hard problems which have to be solved by upcoming compilers or operating systems.
Especially the necessary change of the operating system infrastructure causes
heavy problem porting older 32-Bit applications to 64-Bit compatibility.
Examples for enhanced operating system detection have been described by Kruse [2]. Since assembly programming is a
comfortable task under 32-Bit systems (e.g. MASM, TASM, NASM, FASM and others),
the question stays on how to port software and which changes of the architecture
affects the current programming techniques.
The
64-Bit Architecture
The new 64-Bit Architecture
introduces many innovations. Some of them are described by Jarp [3]:
- Rich Instruction Set
- Bundled Execution
- Predicated Instructions
- Large Register Files
- Register Stack
- Rotating Registers
- Modulo Scheduled Loops
- Control/Data Speculation
- Cache Control Instructions
- High-precision Floating-Point
Compared to
the IA-32 architecture the new IA-64 architecture allows several advantages for
the programmer. One of them is the clear and explicit programming technique
which results in EPIC (Explicit Parallel Instruction Computing). EPIC is a
Intel® IA-64 technique. Additional the programmer can code now more
register-based, which means that everything is kept in registers as long as
possible. The resulting assembly is code has a clear structure:
.align 32
.section .text
.proc demo
demo:: ld4 r32 =
[r32]
mov r55 = r24
add r44 = 32, r10
mov r17 = 3
;;
j_loop:
cmp.le p15,p0 = r32,r0
add r32 = -1,r32
;;
(p15) br.cond.spnt.few j_loop_end
;;
add r55 = r44, r43
fadd f6 = f4, f6
nop.m 0
(p17) ld8 r40 = [r50]
;;
j_loop_end:
(p16) add r50 =
r40,r0
nop.f 0
br.ctop.sptk.few j_loop
;;
_done:
br.ret.sptk.few b0
;;
.endp
It is
noticeable that Intel® offers SSE3 with the introduction of the Prescott processor. These
instructions are included therefore in the "real" 64-Bit processors
of Intel® too (this information has been given by IDF 2004, San Francisco [4]). From this integration it is clear;
that Prescott
is nothing else than a 64-Bit processor which includes all remaining support
but has not unlocked this support.
The AMD
x86-64 differs from the IA-64 structure in important details (See figure 1).
The developer has...
“...access to eight additional GPRs, for a
total of 16 GPRs. Furthermore, there are eight new SIMD registers, added for
use in SSE/SSE2 code. So the number of GPRs and SIMD registers available to
x86-64 programmers will go from eight each to sixteen each [1].”
The IA-64
Registers
64-Bit
simply designate the number of bits that each of the processor's general-purpose
registers (GPRs) can hold. This affects the number of instructions on a 64-Bit
processor too.
In general
the registers of a 64-Bit CPU are twice as wide as those in the 32-Bit CPU. As
difference the instruction register (IR) size has the same size for 32-Bit and
for 64-Bit processors [5-7]. So using of 64-Bit integer for
example gives additional place to store data. A 64-bit architecture can
(theoretically) address up to 18 million terabytes. Unfortunately this can lead
into wasting memory. Such problems have to be optimized in the future and are a
problem of future compiler construction.
As one
resulting example Win64 Pointers are eight bytes long, and not 4 as defined by
Win32 systems. Another example is that the size of LRESULT, WPARAM and LPARAM
will all expand to 64 bits, so message handlers will have to be checked for
inappropriate casts [8]. Under the Intel® IA-64 structure
the numbers of available registers have exploded. The assembly programmer
should ignore most of them and can focus on a few sets of them.
To begin
with, the IA-64 has 128 general-purpose registers, each 64 bits wide [9]. These registers are conceptually
similar to the general-purpose registers such as EAX on the x86. The IA-64
general-purpose registers are named with an r, followed by the register number.
Thus, r0 is the first general-purpose register, and r127 is the last
general-purpose register. The first 32 general-purpose registers (r0-r31) are
static. That is, any code that refers to one of these registers will always be
referring to the exact same register in silicon. All of the x86 registers act
this way, and thus can be considered static. Some of the static registers have
predefined meanings, and are usually referred to some other way than their r
name. The global pointer and the stack pointer are two of the most important
registers that fit this category. The r12 register is used as the stack pointer
and is thus called the sp register. The r1 register is the global pointer, and
you'll see it referred to as the "gp" register.
In addition
to the 32 static general-purpose registers, the IA-64 also has 96 dynamic
general-purpose registers. Dynamic means that a given register name doesn't
always refer to the exact same physical register on the CPU. That is, a
register such as r34 in one function is likely to be assigned to a completely
different physical register than r34 in another function.
What's the
point of dynamic registers? Everything about the IA-64 is focused on speed.
Keeping values in registers and out of memory is one way to keep things running
quickly. If a parameter is passed on the stack, and if that stack location
isn't in the memory cache, the CPU might waste dozens of clock cycles just to
read the parameter from the slow main memory. In contrast, registers can always
be accessed in a single clock cycle [9].
The dynamic
registers exist so that each function can have its own set of up to 96
registers to work with. Inside a function, registers r32 through r127 are all
essentially reserved for just that function. Hopefully, all of the function's
local variables and parameters can be stored in these 96 registers. Of course,
a function may not need all 96 registers, but they are available nonetheless.
In addition to the 128 general-purpose registers, the IA-64 also has 128
floating-point registers. They're named f0 through f127 (somebody probably gets
paid a lot of money to come up with these names). The floating-point registers
are 82 bits in length, allowing them to hold up to a C++ long double. As with
the integer registers, certain floating-point registers have predefined
meanings. For instance, the f0 register is always set to 0.0, while the f1
register always holds the value 1.0. The last set of registers you need to know
here are the branch registers. The IA-64 defines eight branch registers, named
b0 through b7. These are 64-bit registers that contain the address of a code
location that the CPU can transfer control to. On the IA-64, all control
transfers take the form of a branch. The br.call instruction is equivalent to
the x86 CALL; the br.ret instruction is like the x86 RET; and a simple br
instruction is like an x86 JMP.
An example
on how to do Parameter Passing on the IA-64 has been described by Pietrek in
another article [10]. An inside view on the Intel®
64-Bit architecture has been described in detail by Jarp [3].
~
registers assembly
indirect
mnemonic access
---------------------------------------------
application ar n
branch b n
control cr n
CPU
identification cpuid y
data
breakpoint dbr y
instriction
breakpoint ibr y
data TLB
translation dtr y
floating
point f n
general r n
instruction
TLB translation itr y
protection
key pkr y
performance
monitor config pmc y
performance
monitor data pmd y
predicate p n
region rr y
Table 1: Example for understanding
the new registers.
AR0-AR7 equ AR.KR0-AR.KR7 kernel registers
AR8-AR15 RESERVED
AR16 equ AR.RSC register stack
configuration
AR17 equ AR.BSP backing store pointer
(read only)
AR18 equ AR.BSPRESTORE backing store pointer
mem stores
AR19 equ AR.RNAT RSE NaT collection
AR20 RESERVED
AR21 equ AR.FCR IA-32 floating-point
control
AR22,AR23 RESERVED
AR24 equ AR.EFLAG IA-32 EFLAG
AR25 equ AR.CSD IA-32 code segment
descriptor ||
compare
and store data
AR26 equ AR.SSD IA-32 stack segment
descriptor
AR27 equ AR.CFLG IA-32 combined CR0 and CR4
AR28 equ AR.FSR IA-32 floating point status
AR29 equ AR.FIR IA-32 floating point instr.
AR30 equ AR.FDR IA-32 floating point data
AR31 RESERVED
AR32 equ AR.CCV compare and exchange
compare
value
AR33-AR35 RESERVED
AR36 equ AR.UNAT user NaT collection
AR37-AR39 RESERVED
AR40 equ AR.FPSR floatint point status
AR41-AR43 RESERVED
AR44 equ AR.ITC interval time counter
AR45-AR63 RESERVED / AR48- IGNORED
AR64 equ AR.PFS previous function state
AR65 equ AR.LC loop count
AR66 equ AR.EC epilog count
AR67-AR127 RESERVED / AR112- IGNORED
Table 2: Application register
example: thinking with names.
Porting
Applications for Win64
Since there
are hard changes between the architectures of 32-Bit and 64-Bit there evolve
several problems for software developers in porting their software. We put our
focus here on the new Win64 operating system. The Microsoft platform SDK ships
with a pre-beta release of an IA-64 compiler, which allows it to test, and compile
code for Win64.
Therefore Win64 offers 4 different
porting options for existing applications:
- 64-bit Full Port
- Small Address Space with 64-bit
Pointers
- Small Address Space with 32-bit
Pointers
- Win32 Application
A detailed
description of all 4 options including all necessary steps for porting
applications from Win32 to Win64 are described by Intel® Corporation [11]:
Win64 uses
the LLP64 (or P64) uniform data model in which pointers and long are 64 bits,
but int and long are 32 bits. To support this data model and to provide
compatibility with the existing Win32 code, new data types have been defined in
Win64. Some of them are fixed precision data types, for instance INT32 or
INT64. Others change size depending on the architecture, for instance INT_PTR.
In general
simply compiling a Win32 application as Win64 application can cause several
compiler errors. To prevent such behaviour it is necessary to change for
example the Win32 API Calls from:
LONG iVal
= GetWindowLong(hWnd, GWL_HINSTANCE);
to
LONG_PTR
iVal = GetWindowLongPtr(hWnd, GWLP_HINSTANCE);
The problem
is here that four of the existing Win32 APIs that are used to set or get
polymorphic window class data items have been changed [11]:
- Get/SetClassLong changed to
Get/SetClassLongPtr
- Get/SetWindowLong changed to
Get/SetWindowLongPtr
Concerning all those problems it is
necessary to adapt the source code in a massive way:
#if
defined(_WIN32)
… // stuff
related to Win32
#if
!defined(_WIN64)
… // Win32
without Win64 (regular Win32)
#else /*
is _WIN64 also */
… // Win64
variant of Win32
#endif /*
_WIN64 ? */
#elif
defined(__unix) || …
… //
various UNIXes
#else /*
some other OS */
#error
Unhandled OS;
#endif
Problems and solutions for adapting
source code have been given by Chen [12].
Programming
and Assembly under 64-Bit
One
important change for the 64-Bit programmer is the Global Pointer and it's support code which has no equivalent on
32-Bit systems [13]. The Global Pointer is a
preassigned value which makes it possible to access data within a load module.
On the IA64, each instruction is 41 bits in length. The Global Pointer concept
has been described by Pietrek [13].
Assembly
Optimization
Optimizing
assembly code for 64-Bit has been described by Fletcher [14] using the example of the bubble
sort algorithm:
void
bubblesort(int count, int array[]) {
for(int pass = 1; pass < count; pass++) {
for(int i = 0; i < count-1; i++) {
if(array[i] > array[i+1]) {
int hold = array[i];
array[i] = array[i+1];
array[i+1] = hold;
}
}
}
}
After
cleaning up some name mangling the Intel® C++ Compiler produces the following
assembly code [14]:
.proc
bubblesort
bubblesort:
{ .mii
cmp4.ge.unc
p9,p0=1,r32 //0: 20 7
mov
r20=ar.lc //0: 19 2
add
r19=1,r0 //0: 20 6
}
{ .mmb
add
r16=4,r33 //0: 21 31
add
r18=-1,r32 //0: 21 32
(p9)
br.cond.dpnt .b1_1 ;; //0: 20 8
// Block
2: lentry Pred: 0 3 Succ: 3 7
// Freq
5.0e+001, Prob 0.50
}
.b1_2:
{ .mii
add
r19=1,r19 //0: 20 15
cmp4.ge.unc
p8,p0=r0,r18 //0: 21 13
sxt4
r17=r18 //0: 21 27
}
{ .mmb
mov r9=r16
//0: 20 12
mov r3=r33
//0: 20 11
(p8)
br.cond.dpnt .b1_3 ;; //0: 21 14
// Block
7: prolog Pred: 2
// Succ:
17 Freq 5.0e+002, Prob 1.00
}
{ .mmi
add
r11=-1,r17 ;; //0: 0 28
nop.m 0
sxt4
r10=r11 ;; //1: 0 29
}
{ .mii
nop.m 0
mov
ar.lc=r10 //2: 0 30
nop.i 0 ;;
// Block
17: lentry lexit ltail collapsed Pred: 7 17
// Succ: 17 3 Freq
2.5e+003, Prob 0.80
}
.b1_17:
{ .mmi
ld4 r8=[r9] //0: 22 46
ld4 r2=[r3] //0: 22 45
nop.i 0 ;;
}
{ .mii
cmp4.le.unc p0,p6=r2,r8
//1: 22 54
nop.i 0
nop.i 0 ;;
}
{ .mmi
(p6) st4
[r3]=r8 //2: 24 52
(p6) st4
[r9]=r2 //2: 25 53
add
r9=4,r9 //2: 21 50
}
{ .mib
nop.m 0
add
r3=4,r3 //2: 21 49
br.cloop.sptk
.b1_17 ;; //2: 21 51
// Block
3: lexit epilog ltail Pred: 2 17 Succ: 2 1
// Freq
5.0e+001, Prob 0.80
}
.b1_3:
{ .mib
cmp4.lt.unc
p7,p0=r19,r32 //0: 20 16
nop.i 0
(p7)
br.cond.dptk .b1_2 ;; //0: 20 17
// Block
1: exit epilog Pred: 0 3 Succ:
// Freq 1.0e+000, Prob
1.00
}
.b1_1:
{ .mib
nop.m 0
mov ar.lc=r20 //0: 29 9
br.ret.sptk.many b0 ;;
//0: 29 10
}
.endp
Optimizing
the code above needs several steps. One of such steps is eliminating
instructions where three instructions can be eliminated when 2 assumptions can
be made. For example
cmp4.ge.unc
p9,p0=1,r32 //0: 20 7
(p9)
br.cond.dpnt .b1_1 ;; //0: 20 8
mov
r20=ar.lc //0: 19 2
can be optimized to
add
r19=1,r0 //0: 20 6
add
r16=4,r33 //0: 21 31
add
r18=-1,r32 //0: 21 32
Using a bundle the above code results in the
following template. It is important here that $m$ and $i$ is used for the bundle template.
{ .mmi
add
r19=1,r0 //0: 20 6
add
r16=4,r33 //0: 21 31
add r18=-1,r32
;; //0: 21 32
}
One of the
next steps should be the finding of outer loops which is described by Fletcher [14] in detail. A explicit detailed
description of the assembly language for the Itanium architecture is provided
by Intel® Corporation [15-17].
Problems of IA-64
Since the
IA-64 is not using an absolute addressing mode the global variables are
accessed through the $r1$ register which synonym is the Global Pointer. The addl instruction provided by the IA-64
architecture allows only adding constants up to 22 bit. This results in a
maximum size of 4 MB of global variables [18]. For solving this problem the
Intel® C++ Compiler optimizes the code. The IA-64 compiler solves this problem
by splitting global variables into two categories, "small" and
"large" where the difference between small and large can be set via
compiler settings. This results into a difference how assembly code looks like.
The small variable would look like:
addl r30 = -205584, gp;;
// r30 -> global variable
ld4 r30 = [r30]
// load a DWORD from the global
variable
whereas the
large variant result into the following larger code snippet where the variable
itself is allocated in a separate section of the file and a pointer to the
variable is placed into the "small" globals variables section of the
module:
addl r30 = -205584, gp;;
// r30 -> global variable forwarder
ld8 r30 = [r30];;
// r30 -> global variable
ld4 r30 = [r30]
// load a DWORD from the global variable
Such
difference between small and large global variables needs a explicit definition
of variables even in C++ code. Using extern BYTE b[]; will results in that the compiler uses the
large global variables concept to be on the save side. This results in larger
compiled code and possible inefficiency of the code. Additionally the
programmer should take care if he defines a short global variable when it's is
large. The programmer has to take care about this difference when using same
definitions within different headers. Using extern BYTE b; in one file and extern
BYTE b[256]; in another
one will result for sure in some confusion.
Another
problem has been described by Chen [19]. The IA-64 introduces another
possible bad consequence of a mismatched function signature.
One example
of such a bogus code is [20-22]:
void
MyCritSectProc(LPVOID /*nParameter*/)
{ ... }
hMyThread
= CreateThread(NULL, 0,
(LPTHREAD_START_ROUTINE)
MyCritSectProc,
NULL, 0, &MyThreadID);
Another
example by Microsoft® MSDN misdeclares both: the return value and the input
parameter. The result is a crash on Win64 [23]. More examples are described by
Chen [19].
Conclusions
and Future Work
64-Bit is a
necessary evolution in processor architecture. Nevertheless the problems of
porting software from 32-Bit to 64-Bit will cause problems under Win64. As well
the differences between the Intel® and the AMD® Architectures seem to result in
software adoption problems which have been described before in the mid eighties
of the last century. It seems to me more important than before to stay in touch
with assembly code since the different approaches of compilers and vendors
produce not always good and reliable code. Future work should focus on detailed
compiler analysis to pinpoint the absolute differences in resulting assembly
code and in evaluation of the performance of the different compiler dependant
assembly codes which has been done by Loughran before for Win32 environments [24].
References
1. Stokes J: An Introduction to 64-bit Computing and
x86-64. ARS Technica 2004.
2. Kruse
T: Processless Applications -
Remotethreads on Microsoft Windows« 2000, XP and 2003. CodeBreakers Journal 2004, 1(1).
3. Jarp
S: A Detailed Tutorial of the IA-64
architecture. Web 1999.
4. Corporation
I: IDF 2004, San Francisco. In: 2004; 2004.
5. Stokes
J: Understanding the Microprocessor -
Part I: Basic Computing Concepts. ARS
Technica 2004.
6. Stokes
J: The Pentium 4 and the G4e: an
Architectural Comparison. ARS
Technica 2004.
7. Stokes
J: Understanding the Microprocessor -
Part II: Pipelining and Superscalar Execution. ARS Technica 2004.
8. Loughran
S: Frequently Asked Questions about
Windows Programming. ISERAN 2004.
9. Pietrek
M: IA-64 Registers. MSDN Magazine 2001.
10. Pietrek
M: IA-64 Registers, Part 2. MSDN Magazine 2001.
11. Corporation
I: Quick Start Guide for Porting
Applications for Win64. 2004.
12. Chen
B: Intel Itanium Intel ItaniumÖ Porting
Methodologies. Intel China 2004.
13. Pietrek
M: Programming for 64-bit Windows. MSDN Magazine 2000.
14. Fletcher
JA: Optimizing Itanium Processor Family
Assembly Code. Intel Corporation 2004.
15. Intel:
Intel Itanium Architecture Software
Developer's Manual - Volume 1: Application Architecture, Revision 2.1.
2004.
16. Intel:
Intel Itanium Architecture Software
Developer's Manual - Volume 1: System
Architecture, Revision 2.1. 2004.
17. Intel:
Intel Itanium Architecture Software
Developer's Manual - Volume 1: Instruction Set Reference, Revision 2.1.
2004.
18. Chen
R: ia64 - misdeclaring near and far data.
MSDN 2004.
19. Chen
R: Uninitialized garbage on ia64 can be
deadly. MSDN 2004.
20. Clipcode:
Snippets for the Clipcode Systems SDK -
Synchronisation. ARS Technica 2004.
21. Wild
C: Single Thread/Process Global Variable.
2004.
22. Devi
M: Mutex. Long Island University 2004.
23. Asche
RR: Putting DLDETECT to Work. MSDN 2004.
24. Loughran
S: Code for Speed - Writing High
Performance Win32 Code. ISERAN 2004.
|