Statistics

Members: 1927
News: 293
Web Links: 1
Visitors: 3929567

Who's Online

We have 1 guest online
Damn Vulnerable LinuxDamn Vulnerable Linux (DVL) is a Linux-based (modified Damn Small Linux) tool for IT-Security & IT-Anti- Security and Attack & Defense. [CLICK HERE FOR MORE INFOS! ]

Featured Conference Video

T16-Recon2006-Joe_Stewart-OllyBonE.gif OllyBone - Semi-Automatic Unpacking on IA-32. View the conference video here!
Home arrow Submit Your Paper!
Challenge: Code a fast pattern matching algorithm.
User Rating: / 0
PoorBest 
Written by Steve Hutchesson   


Four approaches are presented here, three by Steve Hutchesson, who also wrote a very good introductory text explaining the foundation of the Boyer Moore search algorithm and its variations, and one by buliaNaza who aims at writing the fastest binary string search algorithm for PPlain and PMMX processors.

 

 

Solution

Four approaches are presented here, three by Steve Hutchesson, who also wrote a very good introductory text explaining the foundation of the Boyer Moore search algorithm and its variations, and one by buliaNaza who aims at writing the fastest binary string search algorithm for PPlain and PMMX processors.

                            Three Boyer Moore Exact Pattern Matching Algorithms
by Steve Hutchesson

Three Boyer Moore Exact Pattern Matching Algorithms


Steve Hutchesson
Sydney
Australia
August 2001
{ This e-mail address is being protected from spam bots, you need JavaScript enabled to view it }

In 1977 Robert Boyer and L. Moore designed an exact pattern matching algorithm that was different from any of the contemporary designs of the time. It had a fundamentally different logic that compared the pattern being searched for to the current location in the source in reverse order.

The logic was based on obtaining more information from performing the comparison in reverse than the standard methods of forward comparison. If a character that caused the mismatch was not among the characters that were in the pattern being matched, there was no point in matching any further characters so the pattern could be shifted right by the number of characters needed to go past it.

This shift has usually been called the BAD CHARACTER shift.

|
source : bad character shift
pattern : shift

|

Character "t" mismatches with character "c" in the source. "c" is not in the pattern being searched for and there is no point in searching further back as no match is possible at the current location so the pattern is shifted the number of places right so that the pattern is completely past the mismatching character.

|
source : bad character shift

pattern :      shift
|

Character "t" again mismatches with character "c" in the source so the pattern is again shifted completely past the mismatching character.

|
source : bad character shift

pattern :           shift
|

The next mismatch is different to the previous ones, it is with a character that is within the pattern being searched for and this requires a different type of shift. When a character is within the pattern, it allows the capacity to start matching the pattern to the source. This shift is usually called the GOOD SUFFIX shift but it is sometimes called the MATCHING SHIFT.

The fundamental Boyer Moore design uses a clever method of determining if the character being compared is within the pattern being searched for or not. It constructs a table of 256 members which is initially filled with the length of the pattern being searched for in the source. It then overwrites the position of each character in the pattern into the table at the correct position for the character's ascii value.

This means that a character being compared can be tested in one memory read to determine if it is within the pattern or not, if the shift in the table is the same length as the pattern, the character is not in the pattern, if it is less, it is a character that is in the pattern.

This will produce a set of shifts for the character in the pattern that descend in their value.

pattern : shift

          4321      <- GOOD SUFFIX shift
12345     <- BAD CHARACTER shift

The method of calculating the BAD CHARACTER shift is based on the ascending count from the beginning of the pattern. If it is the first character being compared, the shift is the length of the pattern, for each comparison made, the shift decrements by one.

Apply the GOOD SUFFIX shift from the table and the pattern is shifted across so that the character "s" lines up with the "s" in the source and the pattern has been matched.

*
source : bad character shift

pattern :               shift
*

This example works OK because the mismatch occurs on the first comparison but in patterns that have repeat sequences of characters, this matching by itself will often fail to produce a match.

pattern : foooooo

          711111    <- GOOD SUFFIX shift
1234567   <- BAD CHARACTER shift

The sequence of "1" in the GOOD SUFFIX shift is caused by the overwriting of the location for the character "o" in the table for each of its occurrences. The normal method is to subtract the BAD CHARACTER shift from the GOOD SUFFIX shift if the mismatch is not the first at the current location in the source. This can produce a value less than 1 so a minimum shift of 1 is applied if this happens.

Coding Considerations

Much of the available technical data on exact pattern matching is written in ANSI C and it tends to carry the set of assumptions related to the capacity of that language. The "holy grail" of exact pattern matching is to perform as few comparisons as necessary to obtain the match if it exists. This is usually called "sublinearity" and it means comparing less characters that a traditional forward BYTE scanner.

The problem with this approach is that if the overhead to produce the "sublinearity" is too large, the algorithm is slower than a BYTE scanner so considerations of theoretical design must be tempered with what is possible with good coding practice to deliver the desired speed.

The BAD CHARACTER shift has often been coded in high level languages as another table but it is a very inefficient way to code the shift as the loop counter in the main comparison loop holds the same value and it can be accessed a lot faster than a member of a table in memory.

The three version presented below use an Intel specific optimisation related to preventing a register stall by reading and comparing a byte in AL and subsequently using the EAX register in the table location calculation. XOR EAX, EAX or SUB EAX, EAX both zero the register and the stall does not occur. This makes the code slightly slower on AMD hardware but not by very much.

There is an additional heuristic in the original Boyer Moore algorithm that has not been implemented, when a BAD CHARACTER shift has been determined, the heuristic requires that the larger of the two shifts should be applied. In practice the two extra instructions to perform this comparison reduce the speed of the algorithm by about 5%.

Where a GOOD SUFFIX shift is required that is the first mismatch at the current location, the calculation that subtracts the BAD CHARACTER shift is not required so a seperate loop has been included to save this extra two instructions. The speed increase is about 5% for doing so.

Processor Variation

Testing shows that there is measurable differences between later Intel processors and later AMD processors. The AMD has a shorter pipeline and a lower penalty for register stalls where the Intel processors have better branch prediction and a lower penalty for mispredicted jumps. The GOOD SUFFIX shift favours the AMD processors where the BAD CHARACTER shift works better on the Intel processors.

Three variations are implemented that utilise the different shifts, the original BM algorithm uses both shifts, a variation that is similar to a Horspool variation uses only the BAD CHARACTER shift and another variation only uses the GOOD SUFFIX shift.

Algorithm Variations

The original BM algorithm has a slightly higher overhead than the two variations but it generally produces a larger shift and this has the effect that it is more consistent across both processor types with different patterns and different pattern lengths. This is because it it more dependent on logic that fast loop code.

The Horspool variation perfoms well on Intel hardware and is well suited for plain text search in things like text editors and word processors but it is sensitive to patterns that have a high frequency of characters in the source being searched. Its advantage is small loop code in the searching phase. In this implementation, it does the comparison in reverse order as this method produces the BAD CHARACTER shift in the most efficient manner.

The second variation uses only the GOOD SUFFIX shift and generally performs well on older Intel hardware and later AMD machines. It has the advantage of fast loop code but by only using one of the available shifts, its average shift length is shorter than the original algorithm. It uses the same bypass for the first mismatch that the original BM algo has.

Limitations

The pattern length threshold for improving on a forward byte scanner appears to be about 6 characters. Below this a BYTE scanner is faster. A BM type algorithm has about a 300 character penalty in the time it takes to construct the table and this must be kept in mind if the task requires recursively searching short sources for short patterns.

A slightly more subtle consideration is what is called "mismatch recovery". Boyer Moore algorithms have normally been sensitive to the frequency of end characters in the pattern and this is easy to demonstrate when searching plain text when the pattern has a trailing blank space in it. EXAMPLE : "pattern "

The solution is to code the comparison loop with a very short instruction path and while this does not particularly increase the absolute forward scanning speed of the algorithm type, it does improve its recovery from repeated mismatches.

The three algorithms presented below have very good mismatch recovery which is related to their very short comparison loops instruction paths.

The three algorithms have been tested on Intel Celeron, PII and PIII machines and AMD K6-2, Duron and Athlon machines. They have been optimised to run on both types without specifically targetting one particular model. Slight speed increases can be obtained by coding specifically for one particular model but usually at the expense of most other processors.

The parameters for the 3 procedures.

startpos zero based offset to start searching in the source lpSource the address of the source to search srcLngth the length of the source
lpSubStr the address of the pattern to search for subLngth the length of the pattern

                @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@                                     @
@   The basic Boyer Moore algorithm   @
@                                     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

; #########################################################################

.486
.model flat, stdcall ; 32 bit memory model option casemap :none ; case sensitive

.code

; #########################################################################

BMBinSearch proc startpos:DWORD,

                 lpSource:DWORD,srcLngth:DWORD,
lpSubStr:DWORD,subLngth:DWORD

LOCAL cval :DWORD
LOCAL shift_table[256]:DWORD

push ebx
push esi
push edi

mov ebx, subLngth

cmp ebx, 1
jg @F
mov eax, -2 ; string too short, must be > 1 jmp Cleanup
@@:

mov esi, lpSource
add esi, srcLngth
sub esi, ebx
mov edx, esi ; set Exit Length

; ---------------------------------------- ; load shift table with value in subLngth ; ---------------------------------------- mov ecx, 256
mov eax, ebx
lea edi, shift_table
rep stosd

; ---------------------------------------------- ; load decending count values into shift table ; ----------------------------------------------

    mov ecx, ebx                ; SubString length in ECX
dec ecx                     ; correct for zero based index
mov esi, lpSubStr           ; address of SubString in ESI

lea edi, shift_table

xor eax, eax

Write Shift Chars
mov al, [esi] ; get the character inc esi mov [edi+eax*4], ecx ; write shift for each character dec ecx ; to ascii location in table jnz Write_Shift_Chars

; -----------------------------
; set up for main compare loop
; -----------------------------
mov ecx, ebx
dec ecx
mov cval, ecx

mov esi, lpSource
mov edi, lpSubStr
add esi, startpos ; add starting position

jmp Pre_Loop

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Calc Suffix Shift
add eax, ecx sub eax, cval ; sub loop count jns Add_Suffix_Shift mov eax, 1 ; minimum shift is 1
Add Suffix Shift
add esi, eax ; add SUFFIX shift mov ecx, cval ; reset counter in compare loop
Test Length
cmp edx, esi ; test exit condition jl No_Match
Pre Loop
xor eax, eax ; zero EAX for following partial writes mov al, [esi+ecx] cmp al, [edi+ecx] ; cmp characters in ESI / EDI je @F mov eax, shift_table[eax*4] cmp ebx, eax jne Add_Suffix_Shift ; bypass SUFFIX calculations lea esi, [esi+ecx+1] ; add BAD CHAR shift jmp Test_Length @@: dec ecx xor eax, eax ; zero EAX for following partial writes
Cmp Loop
mov al, [esi+ecx] cmp al, [edi+ecx] ; cmp characters in ESI / EDI jne Set_Shift ; if not equal, get next shift dec ecx jns Cmp_Loop jmp Match ; fall through on match
Set Shift
mov eax, shift_table[eax*4] cmp ebx, eax jne Calc_Suffix_Shift ; run SUFFIX calculations lea esi, [esi+ecx+1] ; add BAD CHAR shift jmp Test_Length

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Match
sub esi, lpSource ; sub source from ESI mov eax, esi ; put length in eax jmp Cleanup
No Match
mov eax, -1
Cleanup
pop edi pop esi pop ebx

ret

BMBinSearch endp

; #########################################################################

end

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ @ @ The Horspool style variation using the BAD CHARACTER shift @ @ @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

; #########################################################################

.486
.model flat, stdcall ; 32 bit memory model option casemap :none ; case sensitive

.code

; #########################################################################

BMHBinsearch proc startpos:DWORD,

                  lpSource:DWORD,srcLngth:DWORD,
lpSubStr:DWORD,subLngth:DWORD

LOCAL cval:DWORD
LOCAL shift_table[256]:DWORD

push ebx
push esi
push edi

mov ebx, subLngth

cmp ebx, 1
jg @F
mov eax, -2 ; string too short, must be > 1 jmp BMHout
@@:

mov esi, lpSource
add esi, srcLngth
sub esi, ebx
mov edx, esi ; set Exit Length

; ---------------------------------------- ; load shift table with value in subLngth ; ---------------------------------------- mov ecx, 256
mov eax, ebx
lea edi, shift_table
rep stosd

; ---------------------------------------------- ; load decending count values into shift table ; ----------------------------------------------

    mov ecx, ebx                ; SubString length in ECX
dec ecx                     ; correct for zero based index
mov esi, lpSubStr           ; address of SubString in ESI

lea edi, shift_table

xor eax, eax

Write Chars
mov al, [esi] ; get the character inc esi mov [edi+eax*4], ecx ; write shift for each character dec ecx ; to ascii location in table jnz Write_Chars

; -----------------------------
; set up for main compare loop
; -----------------------------
mov ecx, ebx
dec ecx
mov cval, ecx

mov esi, lpSource
mov edi, lpSubStr
add esi, startpos ; add starting position

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Main Loop
sub eax, eax ; zero EAX before partial write mov al, [esi+ecx] cmp al, [edi+ecx] ; cmp characters in ESI / EDI jne Get_Shift ; if not equal, get next shift dec ecx jns Main_Loop

jmp Matchx

Get Shift
inc esi ; inc esi for minimum shift cmp ebx, shift_table[eax*4] ; cmp subLngth to char shift jne Exit_Test add esi, ecx ; add bad char shift
Exit Test
mov ecx, cval ; reset counter in compare loop cmp esi, edx ; test for exit condition jl Main_Loop

jmp MisMatch

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Matchx
sub esi, lpSource ; sub source from ESI mov eax, esi ; put length in eax jmp BMHout
MisMatch
mov eax, -1
BMHout
pop edi pop esi pop ebx

ret

BMHBinsearch endp

; #########################################################################

end

        @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@                                                        @
@   The simplified version using the GOOD SUFFIX shift   @
@                                                        @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

; #########################################################################

.486
.model flat, stdcall ; 32 bit memory model option casemap :none ; case sensitive

.code

; #########################################################################

SBMBinSearch proc startpos:DWORD,

                  lpSource:DWORD,srcLngth:DWORD,
lpSubStr:DWORD,subLngth:DWORD

LOCAL shift_table[256]:DWORD

push ebx
push esi
push edi

mov edx, subLngth

cmp edx, 1
jg @F
mov eax, -2 ; string too short, must be > 1 jmp Cleanup
@@:

mov esi, lpSource
add esi, srcLngth
sub esi, edx
mov ebx, esi ; set Exit Length

; ---------------------------------------- ; load shift table with value in subLngth ; ---------------------------------------- mov ecx, 256
mov eax, edx
lea edi, shift_table
rep stosd

; ---------------------------------------------- ; load decending count values into shift table ; ----------------------------------------------

    mov ecx, edx                ; SubString length in ECX
dec ecx                     ; correct for zero based index
mov esi, lpSubStr           ; address of SubString in ESI

lea edi, shift_table

xor eax, eax

Write Shift Chars
mov al, [esi] ; get the character inc esi mov [edi+eax*4], ecx ; write shift for each character dec ecx ; to ascii location in table jnz Write_Shift_Chars

; -----------------------------
; set up for main compare loop
; -----------------------------

mov esi, lpSource
mov edi, lpSubStr
dec edx

    xor eax, eax                ; zero EAX
add esi, startpos           ; add starting position

jmp Cmp_Loop

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Calc Suffix Shift
add ecx, shift_table[eax*4] ; add shift value to loop counter sub ecx, edx ; sub pattern length jns Pre_Compare mov ecx, 1 ; minimum shift is 1
Pre Compare
add esi, ecx ; add suffix shift mov ecx, edx ; reset counter for compare loop
Exit Text
cmp ebx, esi ; test exit condition jl No_Match

xor eax, eax ; clear EAX for following partial writes mov al, [esi+ecx] cmp al, [edi+ecx] ; cmp characters in ESI / EDI je @F add esi, shift_table[eax*4] jmp Exit_Text @@: dec ecx

xor eax, eax ; clear EAX for following partial writes

Cmp Loop
mov al, [esi+ecx] cmp al, [edi+ecx] ; cmp characters in ESI / EDI jne Calc_Suffix_Shift ; if not equal, get next shift dec ecx jns Cmp_Loop jmp Match ; match on fall through

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Match
sub esi, lpSource ; sub source from ESI mov eax, esi ; put length in eax jmp Cleanup
No Match
mov eax, -1
Cleanup
pop edi pop esi pop ebx

ret

SBMBinSearch endp

; #########################################################################

end

********************************** END *************************************