Statistics

Members: 1927
News: 293
Web Links: 1
Visitors: 3932936

Who's Online

Damn Vulnerable LinuxDamn Vulnerable Linux (DVL) is a Linux-based (modified Damn Small Linux) tool for IT-Security & IT-Anti- Security and Attack & Defense. [CLICK HERE FOR MORE INFOS! ]

Featured Conference Video

T16-Recon2006-Joe_Stewart-OllyBonE.gif OllyBone - Semi-Automatic Unpacking on IA-32. View the conference video here!
Home arrow About/Disclaimer
C string functions: _strcpy
User Rating: / 0
PoorBest 
Written by Xbios2   


In this article we will examine two 'modern' _strcpy functions, found in MSVCRT.DLL and Borland C++ Builder library. Those functions are (supposed to be) optimized for Pentium processors. If you're not familiar with optimization for Pentium processors, I suggest you read the document on Pentium optimization by Agner Fog (http://announce.com/agner/assem).

 

 

I. INTRODUCTION

C syntax: char *strcpy(char *dest, const char *src);

_strcpy copies string src to dest, stopping after the terminating null character has been moved, and returns dest.

The 'traditional' way to do this is with the 'rep movs' instruction. BC 4.02 and kernel32 use it. The problem is that it is rather slow (BC _strlen takes 53+5.5n cycles, lstrlenA takes 74+5.5n cycles, and optimizing their code leads to 46+5.5*n cycles wher n the number of chars, see part I of these articles). This is because even though the 'rep movs' instruction is fast it needs to know the number of bytes to copy in advance. So, _strcpy includes a _strlen function before the actual copying, which is implemented through 'repne scasb', a slow instruction.

In this article we will examine two 'modern' _strcpy functions, found in MSVCRT.DLL and Borland C++ Builder library. Those functions are (supposed to be) optimized for Pentium processors. If you're not familiar with optimization for Pentium processors, I suggest you read the document on Pentium optimization by Agner Fog (http://announce.com/agner/assem).

II. STRCPY IN MSVCRT


; c=39+1.75*n / 146 bytes

strcpy proc

        push    edi
mov     edi, [esp+8]    ; dest
mov     ecx, [esp+0Ch]  ; src
test    ecx, 3
jz      short loop1
algn:   mov     dl, [ecx]
inc     ecx
test    dl, dl
jz      short one
mov     [edi], dl
inc     edi
test    ecx, 3
jnz     short algn
loop1:  mov     edx, -81010101h
mov     eax, [ecx]
add     edx, eax
xor     eax, -1
xor     eax, edx
mov     edx, [ecx]
add     ecx, 4
test    eax, 81010100h
jz      short nozero
test    dl, dl
jz      short one
test    dh, dh
jz      short two
test    edx, 0FF0000h
jz      short three
test    edx, 0FF000000h
jz      short four
nozero: mov     [edi], edx
add     edi, 4
jmp     short loop1
;... in the DLL, there is code here, not used by strcpy
one:    mov     [edi], dl
mov     eax, [esp+8]
pop     edi
retn
two:    mov     [edi], dx
mov     eax, [esp+8]
pop     edi
retn
three:  mov     [edi], dx
mov     eax, [esp+8]
mov     byte ptr [edi+2], 0
pop     edi
retn
four:   mov     [edi], edx
mov     eax, [esp+8]
pop     edi
retn

strcpy endp

This procedure does the following:
1. Read arguments (src, dest) from stack 2. Check if src is aligned on a 4 byte boundary

If not, copy byte after byte until src gets aligned 3. Loop

Read one dword from src
Test if there is a zero byte in the dword If no zero, copy dword to dest, loop back 4. Copy the remaining bytes
5. Return with dest in eax

Actually the code above compiles to 130 bytes. The extra 16 bytes are added because between the loop and the 'one:' label there is the strcat function. So 4 conditional jumps take the 6-byte form, not the 2-byte one.

This function takes 39+1.75*n. This means that the loop takes 7 cycles to execute (since each time the loop runs, it copies 4 bytes). Here is the explanation of the loop (U and V refer to the pipe the commands run in).

loop1:  mov     edx, -81010101h ; U   1st
mov     eax, [ecx]      ;  V
add     edx, eax        ; U   2nd
xor     eax, -1         ;  V
xor     eax, edx        ; U   3rd
mov     edx, [ecx]      ;  V
add     ecx, 4          ; U   4th
test    eax, 81010100h  ;  V
jz      short nozero    ; U   5th
...
nozero: mov     [edi], edx      ; U   6th
add     edi, 4          ;  V
jmp     short loop1     ; U   7th

The problem here is that both jumps run in the U pipe so they will not pair. Generally it's better to have an even number of instructions in each block of code. Just by moving one instruction this code will run in 6 cycles (i.e. 39+1.5*n cycles):

loop1:  mov     edx, -81010101h ; U   1st
mov     eax, [ecx]      ;  V
add     edx, eax        ; U   2nd
xor     eax, -1         ;  V
xor     eax, edx        ; U   3rd
mov     edx, [ecx]      ;  V
test    eax, 81010100h  ; U
jz      short nozero    ;  V  4th
...
nozero: mov     [edi], edx      ; U   5th
add     ecx, 4          ;  V