core: Make TwoWaySearcher reset its prefix memory when shifting by byteset #16936

nham · 2014-09-02T06:23:27Z

This is primarily an attempt to fix #16878, which was a false positive in the case of "1234567ah012345678901ah".contains("hah") (and other examples).

Some background: Two-Way starts off by factorizing the needle into two halves (u, v) based on some criteria. it then checks whether u is a suffix of v[:period(v)]. If so, it uses what is called "Algorithm CP1" in Crochemore and Rytter's book Text Algorithms, and "Algorithm CP2" otherwise. CP2 is optimized for needles with large periods.

"hah" happens to get factorized into ("h", "ah"), which means it runs CP1. As far as I understand, this bug can only occur in CP1, because only CP1 uses an extra variable to memorize prefixes (this is the memory field in the TwoWaySearcher struct). To quote from Crochemore and Rytter, p. 317:

The variable s is used to memorize a prefix of the pattern that matches the text at the current position, given the variable pos. Variable s is updated at every mismatch. It can be set to a non-zero value only in the case of a mismatch occurring during the scan of the left part of the pattern.

self.memory (which plays the role of s in our implementation) actually gets reset to 0 when a mismatch occurs during right scan, but it wasn't being reset to 0 during a jump that occurred via byteset mismatch. The byteset thing is not a part of the original Two-Way algorithm, and I actually don't understand how it works. But it checks something to determine if we can skip by the entire length of the needle.

I believe the bug appears when we have a mismatch during left scan that sets self.memory to something non-zero, followed by a bunch of byteset skips, followed by a match on the right scan and only a partial match on the left scan. The fix should be to reset self.memory to 0 when doing a byteset skip, because no prefix can match after such a skip (such skips are longer than the skips for mismatches during right scans, for which it gets reset).

Further evidence that this is the correct fix is that if you remove byteset entirely, the false positives go away.

I've also added extensive comments, maybe too many. Let me know if I've gone into too much detail.

…teset Closes rust-lang#16878.

gereeter · 2014-09-02T13:10:11Z

Blech, it looks like I really messed up in my implementation, with this and the whole banana bug. Thanks for tracking this down - I think your guess is correct and that resetting the memory to 0 is the correct behavior, as we are moving to a place we know nothing about, so should be in a state identical to when we started.

You are correct that the byteset was not in the original algorithm. The idea behind it (which my benchmarks confirmed) was that especially when searching for small needles, there are many characters in the haystack that simply are not contained in the needle. For example, when searching code for some alphanumeric identifier, we will often run into spaces and brackets.

To utilize this, we build a very simple bitset of the characters in the needle. Then, when searching, if we find a character that is not in the needle (as confirmed by this bitset), we can jump forward by the whole length of the needle.

gereeter · 2014-09-02T13:13:07Z

Also, note that the glibc implementation uses a more complicated bad byte shift table for the same purpose.

nham added 2 commits September 2, 2014 01:53

core: Make TwoWaySearcher reset its prefix memory when shifting by by…

e9db8ad

…teset Closes rust-lang#16878.

Add many comments to TwoWaySearcher.

d1bcd77

alexcrichton added a commit to alexcrichton/rust that referenced this pull request Sep 17, 2014

rollup merge of rust-lang#16936 : nham/two_way_makeover

8875584

bors merged commit d1bcd77 into rust-lang:master Sep 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

core: Make TwoWaySearcher reset its prefix memory when shifting by byteset #16936

core: Make TwoWaySearcher reset its prefix memory when shifting by byteset #16936

Uh oh!

nham commented Sep 2, 2014

Uh oh!

gereeter commented Sep 2, 2014

Uh oh!

gereeter commented Sep 2, 2014

Uh oh!

Uh oh!

core: Make TwoWaySearcher reset its prefix memory when shifting by byteset #16936

core: Make TwoWaySearcher reset its prefix memory when shifting by byteset #16936

Uh oh!

Conversation

nham commented Sep 2, 2014

Uh oh!

gereeter commented Sep 2, 2014

Uh oh!

gereeter commented Sep 2, 2014

Uh oh!

Uh oh!