Statistics

Members: 1927
News: 293
Web Links: 1
Visitors: 3962721

Who's Online

We have 2 guests online
Damn Vulnerable LinuxDamn Vulnerable Linux (DVL) is a Linux-based (modified Damn Small Linux) tool for IT-Security & IT-Anti- Security and Attack & Defense. [CLICK HERE FOR MORE INFOS! ]

Featured Conference Video

T16-Recon2006-Joe_Stewart-OllyBonE.gif OllyBone - Semi-Automatic Unpacking on IA-32. View the conference video here!
Splitting Strings
User Rating: / 0
PoorBest 
Written by Mammon   


Those familiar with Perl will undoubtedly have used its split() function, which takes a single string and splits it into multiple strings or into an array, based on a delimiter character specified in the call. Typical invocations of split() would be:

     ($field1, $field2, $junk) = split(':', $line);
@array = split(' ', $line);

In the first line, the source string is split into a maximum of 3 substrings, creating a new string each time it encounters a colon character; note that the third string, $junk, contains the entire rest of the string -- only the first 2 colons will be parsed. In the second line, an array of strings is created by splitting the source string at the space character; since the number of destination strings is not specified, the array will contain one element for each substring [read: each string created by splitting the original at a whitespace character].

Strings and string parsing are notably tedious in assembly. Once learning Perl, I found that the pseudocode for many of my asm programs started to include a few calls to 'split', since it is a handy one-line method of string parsing, applicable to processing command lines, user input, and data files. As a result, it quickly became necessary to write such a routine.

Being that asm has no inherent array or string tokenizing support, there are many possible approaches to string splitting. Since the most immediate problem is that the split() routine does not know in advance how many substrings it will be creating, there is a temptation to code a strtok() replacement, such that the first call returns the first substring, and subsequent calls each return the next substring until the end of the string has been reached:

                  mov ecx, ptrArray
push dword ptrString
push dword [delimiter]
call split
mov [ecx], eax
.loop:
call split
cmp eax, 0
je .end
mov [ecx], eax
add ecx, 4
jmp .loop

.end:

This allows for control over the number of substrings created by only calling split() the desired number of times; however this method also requires a lot of caller-side work --setting up an array, moving the string pointer returned in eax to an appropriate array position, and keeping track of the number of array elements. It is also noticeably more clumsy than the Perl version.

Another method would be to mimic the Perl function entirely, and have split() return an array of substrings:

                  push dword ptrString
push dword [delimiter]
call split
mov [ptrStringArray], eax

This is obviously more elegant on the caller side, but it has a few subtle problems: first, the control over how many elements is split is lost; secondly, the array is of indefinite element size [i.e., one would have to scan each string again in order to find the end and thus the next string]; and lastly, the duplication of the string in memory is somewhat of a waste.

The C language has more or less created a string standard in which strings are terminated with a null ['\0' or 0x0] character. Most library or OS functions to which the split strings will be passed tend to expect this termination; thus each substring is going to have a termination byte added. However, this termination byte can replace the delimiter for each substring, thus allowing the original string itself to serve as the array of substrings after the split function. Thus, all that is required from the split function is to return an array of dword pointers into the original string, and a count of the array elements [substrings]:

                  push dword ptrString
push dword [delimiter]
call split
mov [ptrStringArray], eax
mov [StringArrayNum], ebx

The split function will have to create a DWORD element for each substring it splits; while this is somewhat wasteful, it is still less expensive than copying the entire string a second time, unless the string is composed of 1-3 byte substrings. In order to control the number of splits, a 'max_split' parameter will have to be added to the split() routine, such that if max_split is NULL, the split() routine will return the maximum possible number of substrings; if max_split is non-NULL, split() will return max_split or fewer substrings.

The complete split routine is as follows:

#--------------------------------------------------------------------split.asm ; split( char, string, max_split)

;     Returns address of array of pointers into original string in eax
;     Returns number of array elements in ebx
;     Behavior:
;           split( ":", "this:that:theother:null\0", NULL)
;           "this\0that\0theother\0null\0"
;           ptrArray[0] = [ptrArray+0] = "this\0"
;           ptrArray[1] = [ptrArray+4] = "that\0"
;           ptrArray[2] = [ptrArray+8] = "theother\0"
;           ptrArray[3] = [ptrArray+C] = "null\0"

EXTERN malloc
EXTERN free

split

push ebp mov ebp, esp ;save stack pointer mov ecx, [ebp + 8] ;max# of splits mov edi, [ebp + 12] ;pointer to target string mov ebx, [ebp + 16] ;splitchar

        xor eax, eax                            ;zero out eax for later
mov edx, esp                            ;save current stack pos.
push dword edi                          ;save ptr to first substring
cmp ecx, 0                                      ;is #splits NULL?
jnz do_split            ;--no, start splitting
mov ecx, 0xFFFF                 ;--yes, set to MAX

do_split:

        mov bh, byte [edi]      ;get byte from target string
cmp bl, bh                                      ;equal to delimiter?
je .splitstr            ;--yes, then split it
cmp al, bh                                      ;end of string? [al == 0x0]
je EOS                  ;--yes, then leave split()
inc edi                                         ;next char
loop do_split 
.splitstr:
mov [edi], byte al         ;replace split delimiter with "\0"
inc edi                                         ;move to first char after delimiter
push edi                                                ;save ptr to next substring
loop do_split                           ;loop #splits or till EOS
EOS

mov ecx, edx ;edx, ecx == original stack position sub ecx, esp ;get total size of pushed pointers push ecx ;save size call malloc ;allocate that much space for array test eax, eax jz .error pop ecx ;restore size mov edi, eax ;set destination to beginning of array add edi, ecx ;move to end of array shr ecx, 2 ;divide total size/4 [= # of dwords to move] mov ebx, ecx ;save count

.store:

        sub edi, 4                                      ;move to beginning of dword
pop dword [edi]                         ;pop from stack to array
loop .store

.error:

        mov esp, ebp
pop ebp
ret                                                     ;eax = array[0], ebx = array count

#------------------------------------------------------------------------EOF

The use of the stack in this routine may be a little unclear. Each time a delimiter is encountered, the a pointer to the character after the delimiter is pushed onto the stack:

                  this:that:theother\0
^----------------------This is pushed at the very beginning.
Element: array[0]
this:that:theother\0
^-----------------This is pushed when the first ':' is found.
Element: array[1]
this\0that:theother\0
^-----------This is pushed when the second ':' is found
Element#: array[2]
this\0that\0theother\0
The stack now looks like this:
--------------[ebp]
ptr->string1
ptr->string2
ptr->string3
--------------[esp]
The string pointers are then POPed into the 
array, starting with array[2] and ending with
array[0].

Once the string is parsed and the pointers are PUSHed to the stack, edi is set to the address of the array [mov edi, eax] and advanced to the end of the allocated array [add edi, ecx]. The counter is then set to the number of DWORD pointers that have been pushed onto the stack [shr ecx, 2]; for each DWORD pointer, edi is withdrawn 4 bytes more from the end of the array [sub edi, 4] and the pointer is POPed into that 4 byte space. In the last iteration of the loop, edi is set to the beginning of the allocated array, and the first DWORD pointer [ array[0] ] is POPed into the first array element.

To test this, of course, one needs a program to drive it. The following code simulates an /etc/passwd read, splitting a hard-coded line into its component, colon-delimited fields:

#----------------------------------------------------------------splittest.asm BITS 32
GLOBAL main
EXTERN printf
EXTERN free
EXTERN exit
%include 'split.asm'

SECTION .text
main:

        push dword szString             ;print the original string
push dword szOutput
call printf
add esp, 8
push dword ":"                          ;split the original string
push dword szString
push dword 0
call split
add esp, 12
mov ecx, ebx
mov ebx, eax
printarray:                                             ;print the substrings
push ecx                                                ;printf hoses ecx!!!!!
push dword [ds:ebx]
push dword szOutput
call printf
add esp, 8
add ebx, 4                                      ;skip to next array element
pop ecx
loop printarray
push dword [ptrarray]   ;free the array created by split
call free
add esp, 4
push dword 0                            ;program is done
call exit

SECTION .data

szOutput        db '%s',0Ah,0Dh,0                                                                       ;printf format string
szString        db      'name:password:UID:GID:group:home',0    ;string to print

#------------------------------------------------------------------------EOF

This program was written using nasm on a glibc Linux platform; however the split routine itself is fairly portable --the only assumed external routine is malloc() and -- and can easily be rewritten for the DOS or win32 platforms.