In this article we will examine two 'modern' _strcpy functions, found in
MSVCRT.DLL and Borland C++ Builder library. Those functions are (supposed to be)
optimized for Pentium processors. If you're not familiar with optimization for
Pentium processors, I suggest you read the document on Pentium optimization by
Agner Fog (http://announce.com/agner/assem).
I. INTRODUCTION
C syntax: char *strcpy(char *dest, const char *src);
_strcpy copies string src to dest, stopping after the terminating null character
has been moved, and returns dest.
The 'traditional' way to do this is with the 'rep movs' instruction. BC 4.02
and kernel32 use it. The problem is that it is rather slow (BC _strlen takes
53+5.5n cycles, lstrlenA takes 74+5.5n cycles, and optimizing their code
leads to 46+5.5*n cycles wher n the number of chars, see part I of these
articles). This is because even though the 'rep movs' instruction is fast it
needs to know the number of bytes to copy in advance. So, _strcpy includes a
_strlen function before the actual copying, which is implemented through 'repne
scasb', a slow instruction.
In this article we will examine two 'modern' _strcpy functions, found in
MSVCRT.DLL and Borland C++ Builder library. Those functions are (supposed to be)
optimized for Pentium processors. If you're not familiar with optimization for
Pentium processors, I suggest you read the document on Pentium optimization by
Agner Fog (http://announce.com/agner/assem).
II. STRCPY IN MSVCRT
; c=39+1.75*n / 146 bytes
strcpy proc
push edi
mov edi, [esp+8] ; dest
mov ecx, [esp+0Ch] ; src
test ecx, 3
jz short loop1
algn: mov dl, [ecx]
inc ecx
test dl, dl
jz short one
mov [edi], dl
inc edi
test ecx, 3
jnz short algn
loop1: mov edx, -81010101h
mov eax, [ecx]
add edx, eax
xor eax, -1
xor eax, edx
mov edx, [ecx]
add ecx, 4
test eax, 81010100h
jz short nozero
test dl, dl
jz short one
test dh, dh
jz short two
test edx, 0FF0000h
jz short three
test edx, 0FF000000h
jz short four
nozero: mov [edi], edx
add edi, 4
jmp short loop1
;... in the DLL, there is code here, not used by strcpy
one: mov [edi], dl
mov eax, [esp+8]
pop edi
retn
two: mov [edi], dx
mov eax, [esp+8]
pop edi
retn
three: mov [edi], dx
mov eax, [esp+8]
mov byte ptr [edi+2], 0
pop edi
retn
four: mov [edi], edx
mov eax, [esp+8]
pop edi
retn
strcpy endp
This procedure does the following:
1. Read arguments (src, dest) from stack
2. Check if src is aligned on a 4 byte boundary
If not, copy byte after byte until src gets aligned
3. Loop
Read one dword from src
Test if there is a zero byte in the dword
If no zero, copy dword to dest, loop back
4. Copy the remaining bytes
5. Return with dest in eax
Actually the code above compiles to 130 bytes. The extra 16 bytes are added
because between the loop and the 'one:' label there is the strcat function. So
4 conditional jumps take the 6-byte form, not the 2-byte one.
This function takes 39+1.75*n. This means that the loop takes 7 cycles to
execute (since each time the loop runs, it copies 4 bytes). Here is the explanation
of the loop (U and V refer to the pipe the commands run in).
loop1: mov edx, -81010101h ; U 1st
mov eax, [ecx] ; V
add edx, eax ; U 2nd
xor eax, -1 ; V
xor eax, edx ; U 3rd
mov edx, [ecx] ; V
add ecx, 4 ; U 4th
test eax, 81010100h ; V
jz short nozero ; U 5th
...
nozero: mov [edi], edx ; U 6th
add edi, 4 ; V
jmp short loop1 ; U 7th
The problem here is that both jumps run in the U pipe so they will not pair.
Generally it's better to have an even number of instructions in each block of
code. Just by moving one instruction this code will run in 6 cycles (i.e.
39+1.5*n cycles):
loop1: mov edx, -81010101h ; U 1st
mov eax, [ecx] ; V
add edx, eax ; U 2nd
xor eax, -1 ; V
xor eax, edx ; U 3rd
mov edx, [ecx] ; V
test eax, 81010100h ; U
jz short nozero ; V 4th
...
nozero: mov [edi], edx ; U 5th
add ecx, 4 ; V
|