Statistics

Members: 1927
News: 293
Web Links: 1
Visitors: 3933307

Who's Online

We have 3 guests online
Damn Vulnerable LinuxDamn Vulnerable Linux (DVL) is a Linux-based (modified Damn Small Linux) tool for IT-Security & IT-Anti- Security and Attack & Defense. [CLICK HERE FOR MORE INFOS! ]

Featured Conference Video

T16-Recon2006-Joe_Stewart-OllyBonE.gif OllyBone - Semi-Automatic Unpacking on IA-32. View the conference video here!
Home arrow Articles - Programming arrow Assembly arrow The _strtok function
The _strtok function
User Rating: / 0
PoorBest 
Written by Xbios2   


I. INTRODUCTION

C syntax: char *strtok(char *s1, const char *s2);

Description

strtok considers the string s1 to consist of a sequence of zero or more text tokens, separated by spans of one or more characters from the separator string s2.
The first call to strtok returns a pointer to the first character of the first token in s1 and writes a null character into s1 immediately following the returned token. Subsequent calls with null for the first argument will work through the string s1 in this way, until no tokens remain. The separator string, s2, can be different from call to call.

Comparing different strtok functions requires a much different approach than strlen and strcpy. This is because the overall speed is not related only to n, the length of the main string, but on many other things, such as the total number of tokens, the number of characters between the tokens, the length of the delimiter string, etc...
But this is better seen in practice:

II. _STRTOK IN BC402

This is the disassembly of the Borland C++ 4.02 library. You can read it as an example of non-optimized code, otherwise you may just skip and read the explanation.

_strtok proc near

v_retval        = dword ptr -8
v_pointer       = dword ptr -4
a_s1            = dword ptr  8
a_s2            = dword ptr  0Ch
enter   8, 0
push    edi
push    esi
mov     edi, [ebp+a_s1]
call    __thread_data
add     eax, 18h
test    edi, edi
mov     [ebp+v_pointer], eax
jnz     short first_time
mov     eax, [ebp+v_pointer]
mov     edi, [eax]
first_time:     jmp     short enterPhase1
; --------------------------------------------
loopPhase1:     mov     esi, [ebp+a_s2]
jmp     short enterScan1
; --------------------------------------------
loopScan1:      mov     al, [esi]
cmp     al, [edi]
jz      short endScan1
inc     esi
enterScan1:     mov     al, [esi]
test    al, al
jnz     short loopScan1
endScan1:       mov     al, [esi]
test    al, al
jz      short exitPhase1
inc     edi
enterPhase1:    mov     al, [edi]
test    al, al
jnz     short loopPhase1
exitPhase1:     mov     al, [edi]
test    al, al
jnz     short phase2
mov     eax, [ebp+v_pointer]
mov     [eax], edi
xor     eax, eax
jmp     short return
; --------------------------------------------
phase2:         mov     [ebp+v_retval], edi
jmp     short enterPhase2
; --------------------------------------------
loopPhase2:     mov     esi, [ebp+a_s2]
jmp     short enterScan2
; --------------------------------------------
loopScan2:      mov     al, [esi]
cmp     al, [edi]
jnz     short nextScan2
mov     byte ptr [edi], 0
inc     edi
mov     eax, [ebp+v_pointer]
mov     [eax], edi
mov     eax, [ebp+v_retval]
jmp     short return
; --------------------------------------------
nextScan2:      inc     esi
enterScan2:     mov     al, [esi]
test    al, al
jnz     short loopScan2
inc     edi
enterPhase2:    mov     al, [edi]
test    al, al
jnz     short loopPhase2
mov     eax, [ebp+v_pointer]
mov     [eax], edi
mov     eax, [ebp+v_retval]
return:         pop     esi
pop     edi
leave
retn
_strtok         endp
Explanation
First af all, s1 is checked. If it is null, the pointer to the string is restored from where it was saved on the previous call. (One value is saved for each thread, not for each process, so it can't be stored in the .data segment).

Then we enter the first loop, Phase1. Here characters are read until one is found that does NOT belong to s2 (then jump to Phase2) or the terminating NULL is reached (then just return NULL). Every character read from s1 is compared against all characters in s2 (in the Scan1 loop).

The second loop, Phase2 (if we reach it), reads characters from s1 until one is found that belongs to s2 (then replace it with a NULL, return) or the terminating NULL is reached (then just return).

Actually before entering Phase2, the pointer to the current char in s1 is saved, and it is returned in EAX as the return value.

Notice that the string s1 is modified by strtok.

III. ESTIMATING PERFORMANCE

As noted in the introduction, the speed of _strtok can't be calculated in a 'k+l*n' way. So a more general 'speed' is estimated:

The above strtok function is really slow for three reasons: 1. '_thread_data' calls various other functions, slowing code down.

(actually, this does not exist in the single-thread library) 2. The code is not pentium-optimized, not optimized at all I'd say. 3. The algorithm used is REALLY bad.

Reason 3 is the most important. Even the fastest code written to use this algorithm would be slow. The reason is that there are two NESTED loops, one to find the first character of the token and one to find the first delimiter. Supposing that s1 does not start with a delimiter, then to get the first token, each character of the token is compared to EVERY character of s2. This means that if the first token of s1 is 10 characters long, and s2 contains 10 delimiter characters, at least 1010=100 loop cycles are needed. Try duplicating s2, to be 20 characters long. Without adding information, 1020=200 loop cycles are needed now.

IV. THE LOOKUP TABLE

To get rid of the nested loops, the function must have a 'direct' way of knowing if a character (read from s1) exists in s2. This is accomplished through a lookup table, i.e. a region of memory holding, for each of the 256 different characters, a value that indicates if that character is a delimiter. So the function will consist of one loop scanning s2 and setting values in the lookup table, and another one scanning s1 and comparing it's values against the lookup table.

Some considerations regarding the lookup table must be made: - First of all, forget Unicode. No one would like a lookup table with 65536 entries...
- The values in the table can be either bits or bytes. It's easier and faster to access the values in a byte-sized table, but it takes more time to reset the lookup table each time the function is called - The lookup table can be either local (stack) or global (.data segment). The 32 bytes needed create no problems in either place. It is faster to have a local table (for various reasons), and it's also more reasonable. Yet a global lookup table allows for an interesting extension to _strtok:

It is common to use the same s2 string many times. By using a global table, which is preserved across calls, the caller may pass NULL as s2, to reuse the last lookup table. This way a lookup table is calculated only once. For a more specific application, where speed is essential, but instead of only one delimiter string, there are more, the function can be improved to receive, instead of s2, a pointer to a lookup table, either built in or created by another function

In this essay we'll stick to the 'normal' strtok, using a local table. This is also the way _strtok is implemented in MSVCRT (with bit values). See for yourselves:

V. STRTOK IN MSVCRT

strtok proc near

lookup = byte ptr -20h
a_s1 = dword ptr 4
a_s2 = dword ptr 8

        sub     esp, 20h
push    ebx
push    ebp
mov     ebp, [esp+28h+a_s2]     ; EBP points at s2
push    esi
push    edi
call    GetTls
mov     edx, eax
mov     ecx, 8
xor     eax, eax
lea     edi, [esp+30h+lookup]
mov     [esp+30h+a_s2], edx     ; save in a_s2 the value by GetTls
repe stosd                      ; reset lookup table

; Loop1 : scan s2 and set lookup table values

loop1:  mov     al, [ebp+0]
mov     bl, 1
mov     ecx, eax
and     ecx, 0FFh
mov     esi, ecx
and     ecx, 7
shr     esi, 3
shl     bl, cl
mov     cl, [esp+esi+30h+lookup]
lea     esi, [esp+esi+30h+lookup]
or      cl, bl
inc     ebp
test    al, al
mov     [esi], cl
jnz     short loop1
mov     esi, [esp+30h+a_s1]     ; ESI points at s1
test    esi, esi
jnz     short skip
mov     esi, [edx+18h]          ; s1 NULL, restore previous
skip:   mov     dl, [esi]
mov     eax, 1
mov     edi, edx
and     edi, 0FFh
mov     ecx, edi
and     ecx, 7
shl     eax, cl
shr     edi, 3
mov     cl, [esp+edi+30h+lookup]
test    cl, al
jz      short exit2

;
; Loop2 : find first non delimiter, or NULL

loop2: test dl, dl

        jz      short exit2
mov     dl, [esi+1]
inc     esi
mov     eax, edx
mov     ebx, 1
and     eax, 0FFh
mov     ecx, eax
and     ecx, 7
shl     ebx, cl
shr     eax, 3
mov     al, [esp+eax+30h+lookup]
test    al, bl
jnz     short loop2
exit2:  mov     al, [esi]
mov     edi, esi
test    al, al
jnz     short enter3
jmp     short exit3             ; NULL found, return NULL

; ---------------------------------------------------------------------

; Loop3 : Find delimiter marking the end of the token

loop3:  mov     al, [esi+1]
inc     esi
test    al, al
jz      short exit3
enter3: and     eax, 0FFh
mov     edx, 1
mov     ecx, eax
and     ecx, 7
shl     edx, cl
shr     eax, 3
mov     al, [esp+eax+30h+lookup]
test    al, dl
jz      short loop3
mov     byte ptr [esi], 0       ; mark the end of token
inc     esi                     ; point s1 to next char
exit3:  mov     ecx, [esp+30h+a_s2]     ; value by GetTls
mov     eax, edi
sub     eax, esi
neg     eax
sbb     eax, eax
mov     [ecx+18h], esi          ; store pointer for next call
and     eax, edi
pop     edi
pop     esi
pop     ebp
pop     ebx
add     esp, 20h
retn

strtok endp

GetTls is a function in MSVCRT.DLL that uses GetTlsValue to get thread storage space. The layout is rather easy to understand, the 'bit-fiddling' to access the values in the lookup table is a bit complicated. But it's rather good code. Yet it can be optimized even more at a low level.

VI. FINAL VERSIONS

This is the final version using bit values in the lookup table:

.data
masks db 1,2,4,8,16,32,64,128

.code
_strtok proc

        push    edi
push    ebx
xor     eax, eax
mov     edi, [esp+16]   ; s2 [delimiters]

rept 7

push eax ; reset table endm

push 1 ; set NULL as a separator

        mov     al, [edi]
inc     edi
or      al, al
jz      skip1
mov     ecx, eax
mov     ebx, eax
loop1:  shr     ebx, 3
and     ecx, 7
mov     al, [edi]
inc     edi
mov     dl, masks[ecx]
mov     ah, [esp+ebx]
mov     cl, al
or      dl, ah
mov     [esp+ebx], dl
mov     bl, al
or      al, al
jnz     loop1
skip1:  mov     edi, [esp+44]   ; s1 [string]
xor     eax, eax
test    edi, edi
jnz     short loop2
mov     edi, [pointer]
loop2:  mov     cl, [edi]
inc     edi
mov     ebx, ecx
and     ecx, 7
or      bl, bl
jz      short nomore    ; no more tokens, return NULL
shr     ebx, 3
mov     cl, masks[ecx]
test    [esp+ebx], cl
jnz     short loop2
lea     eax, [edi-1]    ; this is the return value
loop3:  mov     cl, [edi]
mov     ebx, ecx
and     ecx, 7
shr     ebx, 3
inc     edi
mov     cl, masks[ecx]
test    [esp+ebx], cl
jz      short loop3
mov     cl, [edi-1]
return: mov     byte ptr [edi-1], 0
nomore: cmp     cl, 1
sbb     edi, 0
mov     [pointer], edi
add     esp, 20h
pop     ebx
pop     edi
retn

_strtok endp

Again, the general layout is rather simple (same as in MSVCRT) . The bit fiddling is even 'stranger', and the instruction order is a bit weird, in order to achieve optimum pairing. But it runs at about more than twice the speed msvcrt does.

The main features of this version are:
-Space in the stack is allocated by PUSHing 8 times (PUSH pairs with itself, so it only takes 4 cycles to allocate and clear the table) -8 bytes in the .data segment contain the masks used to acces bits 0 to 7 in a byte, thus allowing faster bit-value access. -It works, and it works FAST.

A little bit faster (not always, though, since memory access is more intense) is this version with byte values in the lookup table:

_strtok proc

        xor     ecx, ecx
mov     edx, [esp+8]    ; s2 [delimiters]

rept 63

push ecx ; reset table endm

push 1 ; set NULL as a separator

        mov     cl, [edx]
inc     edx
or      cl, cl
jz      skip1
loop1:  mov     byte ptr [esp+ecx], 1
mov     cl, [edx]
inc     edx
or      cl, cl
jnz     loop1
skip1:  mov     edx, [esp+4+256] ; s1 [string]
xor     eax, eax
test    edx, edx
jnz     short loop2
mov     edx, [pointer]
loop2:  mov     cl, [edx]
inc     edx
or      cl, cl
jz      short nomore    ; no more tokens, return NULL
cmp     byte ptr [esp+ecx], 1
je      short loop2
lea     eax, [edx-1]    ; this is the return value
nop
loop3:  mov     cl, [edx]
inc     edx
cmp     byte ptr [esp+ecx], 1
jne     short loop3
return: mov     byte ptr [edx-1], 0
nomore: cmp     cl, 1
sbb     edx, 0
mov     [pointer], edx
add     esp, 256
retn

_strtok endp