Skip to content

core: Make TwoWaySearcher reset its prefix memory when shifting by byteset #16936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 17, 2014

Conversation

nham
Copy link
Contributor

@nham nham commented Sep 2, 2014

This is primarily an attempt to fix #16878, which was a false positive in the case of "1234567ah012345678901ah".contains("hah") (and other examples).

Some background: Two-Way starts off by factorizing the needle into two halves (u, v) based on some criteria. it then checks whether u is a suffix of v[:period(v)]. If so, it uses what is called "Algorithm CP1" in Crochemore and Rytter's book Text Algorithms, and "Algorithm CP2" otherwise. CP2 is optimized for needles with large periods.

"hah" happens to get factorized into ("h", "ah"), which means it runs CP1. As far as I understand, this bug can only occur in CP1, because only CP1 uses an extra variable to memorize prefixes (this is the memory field in the TwoWaySearcher struct). To quote from Crochemore and Rytter, p. 317:

The variable s is used to memorize a prefix of the pattern that matches the text at the current position, given the variable pos. Variable s is updated at every mismatch. It can be set to a non-zero value only in the case of a mismatch occurring during the scan of the left part of the pattern.

self.memory (which plays the role of s in our implementation) actually gets reset to 0 when a mismatch occurs during right scan, but it wasn't being reset to 0 during a jump that occurred via byteset mismatch. The byteset thing is not a part of the original Two-Way algorithm, and I actually don't understand how it works. But it checks something to determine if we can skip by the entire length of the needle.

I believe the bug appears when we have a mismatch during left scan that sets self.memory to something non-zero, followed by a bunch of byteset skips, followed by a match on the right scan and only a partial match on the left scan. The fix should be to reset self.memory to 0 when doing a byteset skip, because no prefix can match after such a skip (such skips are longer than the skips for mismatches during right scans, for which it gets reset).

Further evidence that this is the correct fix is that if you remove byteset entirely, the false positives go away.

I've also added extensive comments, maybe too many. Let me know if I've gone into too much detail.

@gereeter
Copy link
Contributor

gereeter commented Sep 2, 2014

Blech, it looks like I really messed up in my implementation, with this and the whole banana bug. Thanks for tracking this down - I think your guess is correct and that resetting the memory to 0 is the correct behavior, as we are moving to a place we know nothing about, so should be in a state identical to when we started.

You are correct that the byteset was not in the original algorithm. The idea behind it (which my benchmarks confirmed) was that especially when searching for small needles, there are many characters in the haystack that simply are not contained in the needle. For example, when searching code for some alphanumeric identifier, we will often run into spaces and brackets.

To utilize this, we build a very simple bitset of the characters in the needle. Then, when searching, if we find a character that is not in the needle (as confirmed by this bitset), we can jump forward by the whole length of the needle.

@gereeter
Copy link
Contributor

gereeter commented Sep 2, 2014

Also, note that the glibc implementation uses a more complicated bad byte shift table for the same purpose.

alexcrichton added a commit to alexcrichton/rust that referenced this pull request Sep 17, 2014
@bors bors merged commit d1bcd77 into rust-lang:master Sep 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

find_str returns false positives
4 participants