Four approaches are presented here, three by Steve Hutchesson, who also wrote a
very good introductory text explaining the foundation of the Boyer Moore search
algorithm and its variations, and one by buliaNaza who aims at writing the
fastest binary string search algorithm for PPlain and PMMX processors.
Solution
Four approaches are presented here, three by Steve Hutchesson, who also wrote a
very good introductory text explaining the foundation of the Boyer Moore search
algorithm and its variations, and one by buliaNaza who aims at writing the
fastest binary string search algorithm for PPlain and PMMX processors.
Three Boyer Moore Exact Pattern Matching Algorithms
by Steve Hutchesson
Three Boyer Moore Exact Pattern Matching Algorithms
Steve Hutchesson
Sydney
Australia
August 2001
{
This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
}
In 1977 Robert Boyer and L. Moore designed an exact pattern matching
algorithm that was different from any of the contemporary designs of the
time. It had a fundamentally different logic that compared the pattern
being searched for to the current location in the source in reverse order.
The logic was based on obtaining more information from performing the
comparison in reverse than the standard methods of forward comparison. If a
character that caused the mismatch was not among the characters that were
in the pattern being matched, there was no point in matching any further
characters so the pattern could be shifted right by the number of
characters needed to go past it.
This shift has usually been called the BAD CHARACTER shift.
|
source : bad character shift
pattern : shift
|
Character "t" mismatches with character "c" in the source. "c" is not in
the pattern being searched for and there is no point in searching further
back as no match is possible at the current location so the pattern is
shifted the number of places right so that the pattern is completely past
the mismatching character.
|
source : bad character shift
pattern : shift
|
Character "t" again mismatches with character "c" in the source so the
pattern is again shifted completely past the mismatching character.
|
source : bad character shift
pattern : shift
|
The next mismatch is different to the previous ones, it is with a character
that is within the pattern being searched for and this requires a different
type of shift. When a character is within the pattern, it allows the
capacity to start matching the pattern to the source. This shift is usually
called the GOOD SUFFIX shift but it is sometimes called the MATCHING SHIFT.
The fundamental Boyer Moore design uses a clever method of determining if
the character being compared is within the pattern being searched for or
not. It constructs a table of 256 members which is initially filled with
the length of the pattern being searched for in the source. It then
overwrites the position of each character in the pattern into the table at
the correct position for the character's ascii value.
This means that a character being compared can be tested in one memory read
to determine if it is within the pattern or not, if the shift in the table
is the same length as the pattern, the character is not in the pattern, if
it is less, it is a character that is in the pattern.
This will produce a set of shifts for the character in the pattern that
descend in their value.
pattern : shift
4321 <- GOOD SUFFIX shift
12345 <- BAD CHARACTER shift
The method of calculating the BAD CHARACTER shift is based on the ascending
count from the beginning of the pattern. If it is the first character being
compared, the shift is the length of the pattern, for each comparison made,
the shift decrements by one.
Apply the GOOD SUFFIX shift from the table and the pattern is shifted
across so that the character "s" lines up with the "s" in the source and
the pattern has been matched.
*
source : bad character shift
pattern : shift
*
This example works OK because the mismatch occurs on the first comparison
but in patterns that have repeat sequences of characters, this matching by
itself will often fail to produce a match.
pattern : foooooo
711111 <- GOOD SUFFIX shift
1234567 <- BAD CHARACTER shift
The sequence of "1" in the GOOD SUFFIX shift is caused by the overwriting
of the location for the character "o" in the table for each of its
occurrences. The normal method is to subtract the BAD CHARACTER shift from
the GOOD SUFFIX shift if the mismatch is not the first at the current
location in the source. This can produce a value less than 1 so a minimum
shift of 1 is applied if this happens.
Coding Considerations
Much of the available technical data on exact pattern matching is written
in ANSI C and it tends to carry the set of assumptions related to the
capacity of that language. The "holy grail" of exact pattern matching is to
perform as few comparisons as necessary to obtain the match if it exists.
This is usually called "sublinearity" and it means comparing less
characters that a traditional forward BYTE scanner.
The problem with this approach is that if the overhead to produce the
"sublinearity" is too large, the algorithm is slower than a BYTE scanner so
considerations of theoretical design must be tempered with what is possible
with good coding practice to deliver the desired speed.
The BAD CHARACTER shift has often been coded in high level languages as
another table but it is a very inefficient way to code the shift as the
loop counter in the main comparison loop holds the same value and it can be
accessed a lot faster than a member of a table in memory.
The three version presented below use an Intel specific optimisation
related to preventing a register stall by reading and comparing a byte in
AL and subsequently using the EAX register in the table location
calculation. XOR EAX, EAX or SUB EAX, EAX both zero the register and the
stall does not occur. This makes the code slightly slower on AMD hardware
but not by very much.
There is an additional heuristic in the original Boyer Moore algorithm that
has not been implemented, when a BAD CHARACTER shift has been determined,
the heuristic requires that the larger of the two shifts should be applied.
In practice the two extra instructions to perform this comparison reduce
the speed of the algorithm by about 5%.
Where a GOOD SUFFIX shift is required that is the first mismatch at the
current location, the calculation that subtracts the BAD CHARACTER shift is
not required so a seperate loop has been included to save this extra two
instructions. The speed increase is about 5% for doing so.
Processor Variation
Testing shows that there is measurable differences between later Intel
processors and later AMD processors. The AMD has a shorter pipeline and a
lower penalty for register stalls where the Intel processors have better
branch prediction and a lower penalty for mispredicted jumps. The GOOD
SUFFIX shift favours the AMD processors where the BAD CHARACTER shift works
better on the Intel processors.
Three variations are implemented that utilise the different shifts, the
original BM algorithm uses both shifts, a variation that is similar to a
Horspool variation uses only the BAD CHARACTER shift and another variation
only uses the GOOD SUFFIX shift.
Algorithm Variations
The original BM algorithm has a slightly higher overhead than the two
variations but it generally produces a larger shift and this has the effect
that it is more consistent across both processor types with different
patterns and different pattern lengths. This is because it it more
dependent on logic that fast loop code.
The Horspool variation perfoms well on Intel hardware and is well suited for
plain text search in things like text editors and word processors but it is
sensitive to patterns that have a high frequency of characters in the source
being searched. Its advantage is small loop code in the searching phase. In
this implementation, it does the comparison in reverse order as this method
produces the BAD CHARACTER shift in the most efficient manner.
The second variation uses only the GOOD SUFFIX shift and generally performs
well on older Intel hardware and later AMD machines. It has the advantage of
fast loop code but by only using one of the available shifts, its average shift
length is shorter than the original algorithm. It uses the same bypass for the
first mismatch that the original BM algo has.
Limitations
The pattern length threshold for improving on a forward byte scanner appears to
be about 6 characters. Below this a BYTE scanner is faster. A BM type algorithm
has about a 300 character penalty in the time it takes to construct the table
and this must be kept in mind if the task requires recursively searching short
sources for short patterns.
A slightly more subtle consideration is what is called "mismatch recovery".
Boyer Moore algorithms have normally been sensitive to the frequency of end
characters in the pattern and this is easy to demonstrate when searching plain
text when the pattern has a trailing blank space in it. EXAMPLE : "pattern "
The solution is to code the comparison loop with a very short instruction path
and while this does not particularly increase the absolute forward scanning
speed of the algorithm type, it does improve its recovery from repeated
mismatches.
The three algorithms presented below have very good mismatch recovery which is
related to their very short comparison loops instruction paths.
The three algorithms have been tested on Intel Celeron, PII and PIII machines
and AMD K6-2, Duron and Athlon machines. They have been optimised to run on
both types without specifically targetting one particular model. Slight speed
increases can be obtained by coding specifically for one particular model
but usually at the expense of most other processors.
The parameters for the 3 procedures.
startpos zero based offset to start searching in the source
lpSource the address of the source to search
srcLngth the length of the source
lpSubStr the address of the pattern to search for
subLngth the length of the pattern
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ @
@ The basic Boyer Moore algorithm @
@ @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; #########################################################################
.486
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
.code
; #########################################################################
BMBinSearch proc startpos:DWORD,
lpSource:DWORD,srcLngth:DWORD,
lpSubStr:DWORD,subLngth:DWORD
LOCAL cval :DWORD
LOCAL shift_table[256]:DWORD
push ebx
push esi
push edi
mov ebx, subLngth
cmp ebx, 1
jg @F
mov eax, -2 ; string too short, must be > 1
jmp Cleanup
@@:
mov esi, lpSource
add esi, srcLngth
sub esi, ebx
mov edx, esi ; set Exit Length
; ----------------------------------------
; load shift table with value in subLngth
; ----------------------------------------
mov ecx, 256
mov eax, ebx
lea edi, shift_table
rep stosd
; ----------------------------------------------
; load decending count values into shift table
; ----------------------------------------------
mov ecx, ebx ; SubString length in ECX
dec ecx ; correct for zero based index
mov esi, lpSubStr ; address of SubString in ESI
lea edi, shift_table
xor eax, eax
- Write Shift Chars
- mov al, [esi] ; get the character
inc esi
mov [edi+eax*4], ecx ; write shift for each character
dec ecx ; to ascii location in table
jnz Write_Shift_Chars
; -----------------------------
; set up for main compare loop
; -----------------------------
mov ecx, ebx
dec ecx
mov cval, ecx
mov esi, lpSource
mov edi, lpSubStr
add esi, startpos ; add starting position
jmp Pre_Loop
; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- Calc Suffix Shift
- add eax, ecx
sub eax, cval ; sub loop count
jns Add_Suffix_Shift
mov eax, 1 ; minimum shift is 1
- Add Suffix Shift
- add esi, eax ; add SUFFIX shift
mov ecx, cval ; reset counter in compare loop
- Test Length
- cmp edx, esi ; test exit condition
jl No_Match
- Pre Loop
- xor eax, eax ; zero EAX for following partial writes
mov al, [esi+ecx]
cmp al, [edi+ecx] ; cmp characters in ESI / EDI
je @F
mov eax, shift_table[eax*4]
cmp ebx, eax
jne Add_Suffix_Shift ; bypass SUFFIX calculations
lea esi, [esi+ecx+1] ; add BAD CHAR shift
jmp Test_Length
@@:
dec ecx
xor eax, eax ; zero EAX for following partial writes
- Cmp Loop
- mov al, [esi+ecx]
cmp al, [edi+ecx] ; cmp characters in ESI / EDI
jne Set_Shift ; if not equal, get next shift
dec ecx
jns Cmp_Loop
jmp Match ; fall through on match
- Set Shift
- mov eax, shift_table[eax*4]
cmp ebx, eax
jne Calc_Suffix_Shift ; run SUFFIX calculations
lea esi, [esi+ecx+1] ; add BAD CHAR shift
jmp Test_Length
; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- Match
- sub esi, lpSource ; sub source from ESI
mov eax, esi ; put length in eax
jmp Cleanup
- No Match
- mov eax, -1
- Cleanup
- pop edi
pop esi
pop ebx
ret
BMBinSearch endp
; #########################################################################
end
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ @
@ The Horspool style variation using the BAD CHARACTER shift @
@ @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; #########################################################################
.486
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
.code
; #########################################################################
BMHBinsearch proc startpos:DWORD,
lpSource:DWORD,srcLngth:DWORD,
lpSubStr:DWORD,subLngth:DWORD
LOCAL cval:DWORD
LOCAL shift_table[256]:DWORD
push ebx
push esi
push edi
mov ebx, subLngth
cmp ebx, 1
jg @F
mov eax, -2 ; string too short, must be > 1
jmp BMHout
@@:
mov esi, lpSource
add esi, srcLngth
sub esi, ebx
mov edx, esi ; set Exit Length
; ----------------------------------------
; load shift table with value in subLngth
; ----------------------------------------
mov ecx, 256
mov eax, ebx
lea edi, shift_table
rep stosd
; ----------------------------------------------
; load decending count values into shift table
; ----------------------------------------------
mov ecx, ebx ; SubString length in ECX
dec ecx ; correct for zero based index
mov esi, lpSubStr ; address of SubString in ESI
lea edi, shift_table
xor eax, eax
- Write Chars
- mov al, [esi] ; get the character
inc esi
mov [edi+eax*4], ecx ; write shift for each character
dec ecx ; to ascii location in table
jnz Write_Chars
; -----------------------------
; set up for main compare loop
; -----------------------------
mov ecx, ebx
dec ecx
mov cval, ecx
mov esi, lpSource
mov edi, lpSubStr
add esi, startpos ; add starting position
; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- Main Loop
- sub eax, eax ; zero EAX before partial write
mov al, [esi+ecx]
cmp al, [edi+ecx] ; cmp characters in ESI / EDI
jne Get_Shift ; if not equal, get next shift
dec ecx
jns Main_Loop
jmp Matchx
- Get Shift
- inc esi ; inc esi for minimum shift
cmp ebx, shift_table[eax*4] ; cmp subLngth to char shift
jne Exit_Test
add esi, ecx ; add bad char shift
- Exit Test
- mov ecx, cval ; reset counter in compare loop
cmp esi, edx ; test for exit condition
jl Main_Loop
jmp MisMatch
; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- Matchx
- sub esi, lpSource ; sub source from ESI
mov eax, esi ; put length in eax
jmp BMHout
- MisMatch
- mov eax, -1
- BMHout
- pop edi
pop esi
pop ebx
ret
BMHBinsearch endp
; #########################################################################
end
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ @
@ The simplified version using the GOOD SUFFIX shift @
@ @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; #########################################################################
.486
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
.code
; #########################################################################
SBMBinSearch proc startpos:DWORD,
lpSource:DWORD,srcLngth:DWORD,
lpSubStr:DWORD,subLngth:DWORD
LOCAL shift_table[256]:DWORD
push ebx
push esi
push edi
mov edx, subLngth
cmp edx, 1
jg @F
mov eax, -2 ; string too short, must be > 1
jmp Cleanup
@@:
mov esi, lpSource
add esi, srcLngth
sub esi, edx
mov ebx, esi ; set Exit Length
; ----------------------------------------
; load shift table with value in subLngth
; ----------------------------------------
mov ecx, 256
mov eax, edx
lea edi, shift_table
rep stosd
; ----------------------------------------------
; load decending count values into shift table
; ----------------------------------------------
mov ecx, edx ; SubString length in ECX
dec ecx ; correct for zero based index
mov esi, lpSubStr ; address of SubString in ESI
lea edi, shift_table
xor eax, eax
- Write Shift Chars
- mov al, [esi] ; get the character
inc esi
mov [edi+eax*4], ecx ; write shift for each character
dec ecx ; to ascii location in table
jnz Write_Shift_Chars
; -----------------------------
; set up for main compare loop
; -----------------------------
mov esi, lpSource
mov edi, lpSubStr
dec edx
xor eax, eax ; zero EAX
add esi, startpos ; add starting position
jmp Cmp_Loop
; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- Calc Suffix Shift
- add ecx, shift_table[eax*4] ; add shift value to loop counter
sub ecx, edx ; sub pattern length
jns Pre_Compare
mov ecx, 1 ; minimum shift is 1
- Pre Compare
- add esi, ecx ; add suffix shift
mov ecx, edx ; reset counter for compare loop
- Exit Text
- cmp ebx, esi ; test exit condition
jl No_Match
xor eax, eax ; clear EAX for following partial writes
mov al, [esi+ecx]
cmp al, [edi+ecx] ; cmp characters in ESI / EDI
je @F
add esi, shift_table[eax*4]
jmp Exit_Text
@@:
dec ecx
xor eax, eax ; clear EAX for following partial writes
- Cmp Loop
- mov al, [esi+ecx]
cmp al, [edi+ecx] ; cmp characters in ESI / EDI
jne Calc_Suffix_Shift ; if not equal, get next shift
dec ecx
jns Cmp_Loop
jmp Match ; match on fall through
; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- Match
- sub esi, lpSource ; sub source from ESI
mov eax, esi ; put length in eax
jmp Cleanup
- No Match
- mov eax, -1
- Cleanup
- pop edi
pop esi
pop ebx
ret
SBMBinSearch endp
; #########################################################################
end
********************************** END *************************************
|