Statistics

Members: 1925
News: 293
Web Links: 1
Visitors: 3821647

Who's Online

Damn Vulnerable LinuxDamn Vulnerable Linux (DVL) is a Linux-based (modified Damn Small Linux) tool for IT-Security & IT-Anti- Security and Attack & Defense. [CLICK HERE FOR MORE INFOS! ]

Featured Conference Video

T16-Recon2006-Joe_Stewart-OllyBonE.gif OllyBone - Semi-Automatic Unpacking on IA-32. View the conference video here!
Home
Tips on Saving Bytes in ASM Programs
User Rating: / 0
PoorBest 
Written by Larry Hammick   


The programmer's word for craftsmanship is "optimization". This term refers to conservation, either of program size or execution time. It's time includes not just CPU clocks, but the time consumed by peripherals (e.g. disks, at load time) and by the operating system calls. This article is concerned with the conservation of size, or bytes. Size may refer either to the program file size, or to the size of the memory the program uses. The two are not always identical.

In all the illustrations, we assume that 16-bit code segments are involved. The syntax we use is that of MASM 5.1; the difference from other assemblers is slight.

 

 

1. Avoid uninitialized data.

An instruction like this:

OutputHandle dw ?
is usually a waste of space. Depending on the memory model (i.e. depending on whether we have CS=DS, and the like), there are several ways to omit these two bytes from the program file and the memory image.

If DS is the PSP segment, use:

OutputHandle equ word ptr ds:[5Ch] or similar, for a value other than 5Ch. Any program may safely use any part of the PSP from 3Ch to 07Fh, plus the word at 2Ch (environment segment). When the program is finished with the command tail (bytes 80h-0FFh), it can reuse that area as well. Other parts of the PSP should not be modified, because they may be needed by DOS when the program exits. However, in the case of a TSR, the stay-resident part of the code (e.g. an interrupt handler) may use any part of the PSP after the TSR exit has been executed. In such cases, the PSP makes a handy buffer of 100h bytes with ORG 0.

If DS=CS, you can define uninitialized variables like this:

OutputHandle dw ?

    InputHandle     dw  ?
ORG             OutputHandle
Go: mov ah,30h
int 21h
...
mov OutputHandle,ax
...
END Go

or, equivalently:

    OutputHandle    equ     word ptr ds:[Go]
InputHandle     equ     word ptr ds:[Go+2].

If DS is a dynamically allocated segment, or if it is part of the stack, there is this trick:

    OutputHandle    equ     word ptr ds:[0]
InputHandle     equ     word ptr ds:[2].

Allocating file and memory space just for uninitialized variables wastes a few bytes here and there. Much worse, for file size, is to put whole buffers and stacks in the file:

    ReadBuffer      db  1000h   dup (0)
Stack           db  40h     dup ("--Stack!--")

Examine a few commercial programs under a hex editor or debugger to see how common this practice is. Worldwide, the quantity of disk space thus wasted must be astronomical. Moreover, such "data" gets copied from disk every time the program is loaded, even though it has no meaning! Perhaps assemblers and linkers will someday be smart enough to avoid this. For now, we do have EXE packers such as PKLite to compress blank data blocks, but the latter can be avoided entirely as follows.

If DS is a dynamic segment or part of the stack:

    BufferSize      equ     1000h
ReadBuffer      equ     0
WriteBuffer     equ     ReadBuffer+BufferSize
...
mov dx,ReadBuff ;rather than mov dx,offset ReadBuff
mov ah,3Fh
int 21h
...

If the program will be small enough for the code and all data to fit in one segment, it is desirable to have CS=DS. Then you can do:

    ReadBuffer      equ     offset EndOfCode
WriteBuffer     equ     ReadBuffer+BufferSize
Go:
...     ;code instructions
mov ah,4Ch
int 21h     ;exit
EndOfCode label byte
END Go

This practice is not quite safe for a COM program, because DOS will load a COM file into less than 64K if no larger block is available or if memory is fragmented. For an EXE, the EXE header can be adjusted to prevent the program from loading into too little memory.

2. Put related data together.


An example:

CursorPosition label word

    CursorColumn    db      0Eh
CursorRow       db      8

You will be able to load or save both variables with one instruction:

mov dx,CursorPosition

Another benefit:

        and CursorPosition,0FF00h
jnz NotAtTop

The AND instruction sets one byte and tests another, at the same time.

3. Avoid forward references.


Forward references in source can result in worthless NOP's getting assebled. This is another illustration of the principle that assemblers are pretty dumb.
Consider

mov cx,MsgSize ;(1) ... Msg db "Hello",0Dh,0Ah MsgSize equ $-offset Msg

MsgSize is a constant word. But MASM doen't know that when it assembles the instruction (1). So it provides 3 bytes for MsgSize, and later fills in the constant word followed by a NOP byte. One solution:

            db      0B9h     ;opcode for mov cx,immed
dw      MsgSize
...
Msg     db      "Hello",0Dh,0Ah
MsgSize equ     $-offset Msg

4. Use cheap opcodes.


4.1 XCHG AX,Reg16
These 8 instructions are each just 1 byte. Don't use either MOV AX,CX or MOV CX,AX unless you need the same value in both registers. AX is special in this respect; instructions such as XCHG BX,CX or XCHG SI,DI are two bytes. XCHG EAX,Reg32 is two bytes (in 16-bit code segments), whereas MOV EAX,ECX etc. is three.

4.2 CBW, CDW, CDQ
To put AH=0, the instructions

        xor ah,ah
sub ah,ah
mov ah,0

occupy two bytes each. But if you know that AL > 0, the instruction CBW has the same effect (except that it leaves the flags unchanged) and is only one byte. Likewise, CWD can save over XOR DX,DX. CDQ is a 2-byte opcode but still better than XOR EDX,EDX, which is 3 bytes.

4.3 JCXZ
This instruction does not require a preliminary flag-setting instruction. So, you might prefer

        xchg ax,cx
jcxz Mylabel
to
or ax,ax
jz MyLabel,

saving one byte. Be aware that JCXZ is a relatively slow opcode.

4.4 INC Reg16 and DEC Reg16
These 16 opcodes are just one byte each. The opcodes INC Reg8 and DEC Reg8 are 2-byte. So use INC CX instead of INC CL if there is no possibility of carry from CL into CH. If CX is known to be 0, INC CX saves a byte vs. MOV CL,1, and 2 bytes vs. MOV CX,1. Similar tricks apply to going from -1 to 0, to decrementing from 1 to 0 or from 0 to -1.

4.5 Prefer the accumulator to other registers. The following opcodes, among others, are cheaper for AX or EAX than for other general registers.

        MOV reg,mem
MOV mem,reg
ADD reg,mem

5. Be flexible on flow control.


Block-structuring is very sensible in high-level languages, but in ASM it is little more than a pedantic habit. In ASM, a routine may have more than one entry point and more than one exit (RETN, RETF, or IRET). Several routines may share exit code or entry code. A routine need not return at all. A few examples of how this can save bytes:

5.1 Discard return addresses that won't be needed. This sort of thing appears often:

Mysub: cmp al,3

            ja StcRet
...
ret
StcRet: stc
ret
...
call MySub
jc Ret1
...

Ret1: ret

Better is:

Mysub: cmp al,3

             ja DontRet
...
ret
DontRet: pop ax ;discard return address into some unneeded register
ret
...
call MySub  ;returns only if input is okay
...

5.2 Reuse exit code.
If you see this more than once in your source:

        pop bx
pop dx
pop ax
retn,

make a label at POP BX, and use a jump to that label from each other occurrence. If this happens more than once:

        push ax
push cx
push dx
push bx
consider a subroutine:
SaveRegs: pop si                ;store return address in an unneeded register
push ax
push cx
push dx
push bx
jmp si

5.3. Consider CALL instead of JMP.
The CALL instruction can be used instead of JMP to pass a near address at almost no cost.

                 mov ah,30h
int 21h
cmp al,5
jae EnoughDOS
call ErrExit
db "This program requires DOS 5+",13,10,0
EnoughDOS:
...
ErrExit:     pop si ;"Return address" actually points at data.
ErrExitLoop: lodsb
or al,al
jz Exit
int 29h
jmp short ErrExitLoop
Exit:        mov ax,4CFFh
int 21h

In the above example, the routine ErrExit writes an ASCIIZ string from CS:SI, then exits.

The offset of a jump table can sometimes be passed in the same way.

        call SmartJump  ;does not return
db      3
dw      Handle3  ;Handle3 and Handle7 are near code addresses
db      7
dw      Handle7
db      0        ;terminator for the table
SmartJump:      ;input is a jump table index AL.
pop di   ;"return address" actually points at the jump table
SmartJumpLoop: cmp byte ptr[di],0
je NotFound
scasb
je Found
scasw       ;cheaper than incrementing di twice
jmp short SmartJumpLoop
Found:         jmp word ptr es:[di]
NotFound: ...

The above example assumes ES=CS.

5.4 Short jumps are cheaper than near jumps. You can often save a few bytes by arranging your source so that jumps are short rather than near.

If this occurs:

        cmp al,5
jne Not5
jmp CantRun
Not5:
...
jmp CantRun
...

and CantRun is not reachable by a short jump in either instance, you might still save a byte like so:

                cmp al,5
jne Not5
JmpCantRun: jmp CantRun
Not5:
...
jmp short JmpCantRun    ;2-step jump
...

6. Registers are cheaper than constants.


You should never write this (6 bytes):
        mov si,StringSite               ;a 16-bit constant
mov di,StringSite

Instead (5 bytes):

        mov si,StringSite
mov di,si.

Another illustration:

        MyByte db 11h
...
mov MyByte,0            ;a 5-byte instruction
mov MyByte,bh           ;4 bytes, and equivalent if bh is known to be 0
mov MyByte,al           ;only 3 bytes.

7. Code can be used as data.


Here are two examples of a slick technique known as self-modifying code.
        ErrExit: call WriteMessage
db   0B8h     ;code for MOV AX,Immed16
ReturnCode db   ?,4Ch
int 21h     ;exit from program

The label ErrExit can be reached by JMP's from several points in the program. Before jumping, the code pokes in a suitable value of ReturnCode, depending on the type of error condition encountered. The above example uses part of the instruction MOV AX,4Cxxh as a variable, saving bytes.

              mov ax,252Fh        ;get INT 2Fh vector as ES:BX
int 21h
mov OldInt2F,bx     ;this example assumes CS=DS at this point
mov OldInt2F[2],es
mov dx,offset OurInt2F
mov ax,252Fh        ;set INT 2Fh vector to DS:DX
...
OurInt2F: cmp ax,1211h  ;a function that we want to control
jne short JmpOldInt2F
... (handle this function)
iret
JmpOldInt2F: db     0eah    ;opcode for jump to immediate far address
OldInt2F     dw     ?,?

This manoeuvre saves bytes versus JMP DWORD PTR OldInt2F; again, the method is by putting the variable (OldInt2F) right inside the code. Device drivers and other TSR's should use this trick, but I don't know of a single one which does (except my own, naturally).

Safe use of self-modifying code requires some awareness of on-chip instruction caches. It's no good to modify code in memory if what will get executed is already on the CPU. The following trick, however, is quite safe. Instead of:

        ErrExit2: mov al,2
jmp short ErrExit
ErrExit3: mov al,3
jmp short ErrExit
ErrExit5: mov al,5
ErrExit: mov ah,4Ch
int 21h
write

ErrExit2: mov al,2

db 3Dh ;opcode for CMP AX,immed, to disable the following ErrExit3: mov al,3 ;2-byte instruction

db 3Dh ErrExit5: mov al,5 ErrExit: mov ah,4Ch

int 21h 8. Miscellaneous byte-savers.


Since the instruction sets of the x86 CPU's are so elaborate, there are many more ad hoc ways to reduce, reuse, and recycle bytes. The following are only a few.

8.1 After a loop, CX is 0. Thus

            mov cx,1234h
MyLoop: ...
...
loop MyLoop
mov cx,56h
...
is wasteful. The last instruction should be
mov cl,56h.

8.2 Use conditional MOV's.

                    cmp VideoMode,7
je BlackAndWhite
mov dx,0B800h
jmp short Either
BlackAndWhite:  mov dx, 0B000h
Either:
...
The above code wastes bytes. Better is:
mov dx, 0B800h
cmp VideoMode,7
jne GotVideoBase
mov dh,0B0h
GotVideoBase:
...

The improved version has one jump instruction instead of two, and in this example saves an additional byte by resetting only DH, not DX.

With the Pentium, Intel introduced a useful set of conditional mov's right into the instruction set.

8.3 To test the high bit of a register, avoid the constants 80h and 8000h. For example,

        test dh,80h
jnz MyLabel
is 5 bytes, but
or dh,dh
js MyLabel

is 4. The latter instruction also leaves more information in the flags. TEST DH,DH or AND DH,DH have the same effect as OR DH,DH.

8.4 To determine if several variables of the same size are all 0, OR them together, and the zero flag will tell you. To determine if they are all -1, AND them together and increment the result.

9. Postlude


Intel makes their excellent CPU documentation available free, from:

http://developer.intel.com/design/litcentr/index.htm It is in Adobe PDF format; you will need the Acrobat Reader, also free, from:

http://www.adobe.com/prodindex/acrobat/readstep.html

If all else fails, you can try to wake me up at:

This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
Regards from Vancouver,
Larry