I. INTRODUCTION
C syntax: char *strtok(char *s1, const char *s2);
Description
strtok considers the string s1 to consist of a sequence of zero or more text
tokens, separated by spans of one or more characters from the separator string
s2.
The first call to strtok returns a pointer to the first character of the first
token in s1 and writes a null character into s1 immediately following the
returned token. Subsequent calls with null for the first argument will work
through the string s1 in this way, until no tokens remain.
The separator string, s2, can be different from call to call.
Comparing different strtok functions requires a much different approach than
strlen and strcpy. This is because the overall speed is not related only to n,
the length of the main string, but on many other things, such as the total
number of tokens, the number of characters between the tokens, the length of the
delimiter string, etc...
But this is better seen in practice:
II. _STRTOK IN BC402
This is the disassembly of the Borland C++ 4.02 library. You can read it as an
example of non-optimized code, otherwise you may just skip and read the
explanation.
_strtok proc near
v_retval = dword ptr -8
v_pointer = dword ptr -4
a_s1 = dword ptr 8
a_s2 = dword ptr 0Ch
enter 8, 0
push edi
push esi
mov edi, [ebp+a_s1]
call __thread_data
add eax, 18h
test edi, edi
mov [ebp+v_pointer], eax
jnz short first_time
mov eax, [ebp+v_pointer]
mov edi, [eax]
first_time: jmp short enterPhase1
; --------------------------------------------
loopPhase1: mov esi, [ebp+a_s2]
jmp short enterScan1
; --------------------------------------------
loopScan1: mov al, [esi]
cmp al, [edi]
jz short endScan1
inc esi
enterScan1: mov al, [esi]
test al, al
jnz short loopScan1
endScan1: mov al, [esi]
test al, al
jz short exitPhase1
inc edi
enterPhase1: mov al, [edi]
test al, al
jnz short loopPhase1
exitPhase1: mov al, [edi]
test al, al
jnz short phase2
mov eax, [ebp+v_pointer]
mov [eax], edi
xor eax, eax
jmp short return
; --------------------------------------------
phase2: mov [ebp+v_retval], edi
jmp short enterPhase2
; --------------------------------------------
loopPhase2: mov esi, [ebp+a_s2]
jmp short enterScan2
; --------------------------------------------
loopScan2: mov al, [esi]
cmp al, [edi]
jnz short nextScan2
mov byte ptr [edi], 0
inc edi
mov eax, [ebp+v_pointer]
mov [eax], edi
mov eax, [ebp+v_retval]
jmp short return
; --------------------------------------------
nextScan2: inc esi
enterScan2: mov al, [esi]
test al, al
jnz short loopScan2
inc edi
enterPhase2: mov al, [edi]
test al, al
jnz short loopPhase2
mov eax, [ebp+v_pointer]
mov [eax], edi
mov eax, [ebp+v_retval]
return: pop esi
pop edi
leave
retn
_strtok endp
- Explanation
- First af all, s1 is checked. If it is null, the pointer to the string is
restored from where it was saved on the previous call. (One value is saved for
each thread, not for each process, so it can't be stored in the .data segment).
Then we enter the first loop, Phase1. Here characters are read until one is
found that does NOT belong to s2 (then jump to Phase2) or the terminating NULL
is reached (then just return NULL). Every character read from s1 is compared
against all characters in s2 (in the Scan1 loop).
The second loop, Phase2 (if we reach it), reads characters from s1 until one is
found that belongs to s2 (then replace it with a NULL, return) or the
terminating NULL is reached (then just return).
Actually before entering Phase2, the pointer to the current char in s1 is saved,
and it is returned in EAX as the return value.
Notice that the string s1 is modified by strtok.
III. ESTIMATING PERFORMANCE
As noted in the introduction, the speed of _strtok can't be calculated in a
'k+l*n' way. So a more general 'speed' is estimated:
The above strtok function is really slow for three reasons:
1. '_thread_data' calls various other functions, slowing code down.
(actually, this does not exist in the single-thread library)
2. The code is not pentium-optimized, not optimized at all I'd say.
3. The algorithm used is REALLY bad.
Reason 3 is the most important. Even the fastest code written to use this
algorithm would be slow. The reason is that there are two NESTED loops, one to
find the first character of the token and one to find the first delimiter.
Supposing that s1 does not start with a delimiter, then to get the first token,
each character of the token is compared to EVERY character of s2. This means
that if the first token of s1 is 10 characters long, and s2 contains 10
delimiter characters, at least 1010=100 loop cycles are needed. Try duplicating
s2, to be 20 characters long. Without adding information, 1020=200 loop cycles
are needed now.
IV. THE LOOKUP TABLE
To get rid of the nested loops, the function must have a 'direct' way of knowing
if a character (read from s1) exists in s2. This is accomplished through a
lookup table, i.e. a region of memory holding, for each of the 256 different
characters, a value that indicates if that character is a delimiter. So the
function will consist of one loop scanning s2 and setting values in the lookup
table, and another one scanning s1 and comparing it's values against the lookup
table.
Some considerations regarding the lookup table must be made:
- First of all, forget Unicode. No one would like a lookup table with 65536
entries...
- The values in the table can be either bits or bytes. It's easier and faster to
access the values in a byte-sized table, but it takes more time to reset the
lookup table each time the function is called
- The lookup table can be either local (stack) or global (.data segment). The 32
bytes needed create no problems in either place. It is faster to have a local
table (for various reasons), and it's also more reasonable. Yet a global lookup
table allows for an interesting extension to _strtok:
It is common to use the same s2 string many times. By using a global table,
which is preserved across calls, the caller may pass NULL as s2, to reuse
the last lookup table. This way a lookup table is calculated only once.
For a more specific application, where speed is essential, but instead of
only one delimiter string, there are more, the function can be improved to
receive, instead of s2, a pointer to a lookup table, either built in or
created by another function
In this essay we'll stick to the 'normal' strtok, using a local table.
This is also the way _strtok is implemented in MSVCRT (with bit values). See for
yourselves:
V. STRTOK IN MSVCRT
strtok proc near
lookup = byte ptr -20h
a_s1 = dword ptr 4
a_s2 = dword ptr 8
sub esp, 20h
push ebx
push ebp
mov ebp, [esp+28h+a_s2] ; EBP points at s2
push esi
push edi
call GetTls
mov edx, eax
mov ecx, 8
xor eax, eax
lea edi, [esp+30h+lookup]
mov [esp+30h+a_s2], edx ; save in a_s2 the value by GetTls
repe stosd ; reset lookup table
; Loop1 : scan s2 and set lookup table values
loop1: mov al, [ebp+0]
mov bl, 1
mov ecx, eax
and ecx, 0FFh
mov esi, ecx
and ecx, 7
shr esi, 3
shl bl, cl
mov cl, [esp+esi+30h+lookup]
lea esi, [esp+esi+30h+lookup]
or cl, bl
inc ebp
test al, al
mov [esi], cl
jnz short loop1
mov esi, [esp+30h+a_s1] ; ESI points at s1
test esi, esi
jnz short skip
mov esi, [edx+18h] ; s1 NULL, restore previous
skip: mov dl, [esi]
mov eax, 1
mov edi, edx
and edi, 0FFh
mov ecx, edi
and ecx, 7
shl eax, cl
shr edi, 3
mov cl, [esp+edi+30h+lookup]
test cl, al
jz short exit2
;
; Loop2 : find first non delimiter, or NULL
loop2: test dl, dl
jz short exit2
mov dl, [esi+1]
inc esi
mov eax, edx
mov ebx, 1
and eax, 0FFh
mov ecx, eax
and ecx, 7
shl ebx, cl
shr eax, 3
mov al, [esp+eax+30h+lookup]
test al, bl
jnz short loop2
exit2: mov al, [esi]
mov edi, esi
test al, al
jnz short enter3
jmp short exit3 ; NULL found, return NULL
; ---------------------------------------------------------------------
; Loop3 : Find delimiter marking the end of the token
loop3: mov al, [esi+1]
inc esi
test al, al
jz short exit3
enter3: and eax, 0FFh
mov edx, 1
mov ecx, eax
and ecx, 7
shl edx, cl
shr eax, 3
mov al, [esp+eax+30h+lookup]
test al, dl
jz short loop3
mov byte ptr [esi], 0 ; mark the end of token
inc esi ; point s1 to next char
exit3: mov ecx, [esp+30h+a_s2] ; value by GetTls
mov eax, edi
sub eax, esi
neg eax
sbb eax, eax
mov [ecx+18h], esi ; store pointer for next call
and eax, edi
pop edi
pop esi
pop ebp
pop ebx
add esp, 20h
retn
strtok endp
GetTls is a function in MSVCRT.DLL that uses GetTlsValue to get thread storage
space. The layout is rather easy to understand, the 'bit-fiddling' to access the
values in the lookup table is a bit complicated. But it's rather good code. Yet
it can be optimized even more at a low level.
VI. FINAL VERSIONS
This is the final version using bit values in the lookup table:
.data
masks db 1,2,4,8,16,32,64,128
.code
_strtok proc
push edi
push ebx
xor eax, eax
mov edi, [esp+16] ; s2 [delimiters]
rept 7
push eax ; reset table
endm
push 1 ; set NULL as a separator
mov al, [edi]
inc edi
or al, al
jz skip1
mov ecx, eax
mov ebx, eax
loop1: shr ebx, 3
and ecx, 7
mov al, [edi]
inc edi
mov dl, masks[ecx]
mov ah, [esp+ebx]
mov cl, al
or dl, ah
mov [esp+ebx], dl
mov bl, al
or al, al
jnz loop1
skip1: mov edi, [esp+44] ; s1 [string]
xor eax, eax
test edi, edi
jnz short loop2
mov edi, [pointer]
loop2: mov cl, [edi]
inc edi
mov ebx, ecx
and ecx, 7
or bl, bl
jz short nomore ; no more tokens, return NULL
shr ebx, 3
mov cl, masks[ecx]
test [esp+ebx], cl
jnz short loop2
lea eax, [edi-1] ; this is the return value
loop3: mov cl, [edi]
mov ebx, ecx
and ecx, 7
shr ebx, 3
inc edi
mov cl, masks[ecx]
test [esp+ebx], cl
jz short loop3
mov cl, [edi-1]
return: mov byte ptr [edi-1], 0
nomore: cmp cl, 1
sbb edi, 0
mov [pointer], edi
add esp, 20h
pop ebx
pop edi
retn
_strtok endp
Again, the general layout is rather simple (same as in MSVCRT) . The bit
fiddling is even 'stranger', and the instruction order is a bit weird, in order
to achieve optimum pairing. But it runs at about more than twice the speed
msvcrt does.
The main features of this version are:
-Space in the stack is allocated by PUSHing 8 times (PUSH pairs with itself, so
it only takes 4 cycles to allocate and clear the table)
-8 bytes in the .data segment contain the masks used to acces bits 0 to 7 in a
byte, thus allowing faster bit-value access.
-It works, and it works FAST.
A little bit faster (not always, though, since memory access is more intense) is
this version with byte values in the lookup table:
_strtok proc
xor ecx, ecx
mov edx, [esp+8] ; s2 [delimiters]
rept 63
push ecx ; reset table
endm
push 1 ; set NULL as a separator
mov cl, [edx]
inc edx
or cl, cl
jz skip1
loop1: mov byte ptr [esp+ecx], 1
mov cl, [edx]
inc edx
or cl, cl
jnz loop1
skip1: mov edx, [esp+4+256] ; s1 [string]
xor eax, eax
test edx, edx
jnz short loop2
mov edx, [pointer]
loop2: mov cl, [edx]
inc edx
or cl, cl
jz short nomore ; no more tokens, return NULL
cmp byte ptr [esp+ecx], 1
je short loop2
lea eax, [edx-1] ; this is the return value
nop
loop3: mov cl, [edx]
inc edx
cmp byte ptr [esp+ecx], 1
jne short loop3
return: mov byte ptr [edx-1], 0
nomore: cmp cl, 1
sbb edx, 0
mov [pointer], edx
add esp, 256
retn
_strtok endp
|