core: Make TwoWaySearcher reset its prefix memory when shifting by byteset #16936
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is primarily an attempt to fix #16878, which was a false positive in the case of
"1234567ah012345678901ah".contains("hah")
(and other examples).Some background: Two-Way starts off by factorizing the needle into two halves
(u, v)
based on some criteria. it then checks whetheru
is a suffix ofv[:period(v)]
. If so, it uses what is called "Algorithm CP1" in Crochemore and Rytter's book Text Algorithms, and "Algorithm CP2" otherwise. CP2 is optimized for needles with large periods."hah"
happens to get factorized into("h", "ah")
, which means it runs CP1. As far as I understand, this bug can only occur in CP1, because only CP1 uses an extra variable to memorize prefixes (this is thememory
field in theTwoWaySearcher
struct). To quote from Crochemore and Rytter, p. 317:self.memory
(which plays the role ofs
in our implementation) actually gets reset to 0 when a mismatch occurs during right scan, but it wasn't being reset to 0 during a jump that occurred viabyteset
mismatch. Thebyteset
thing is not a part of the original Two-Way algorithm, and I actually don't understand how it works. But it checks something to determine if we can skip by the entire length of the needle.I believe the bug appears when we have a mismatch during left scan that sets
self.memory
to something non-zero, followed by a bunch of byteset skips, followed by a match on the right scan and only a partial match on the left scan. The fix should be to resetself.memory
to 0 when doing a byteset skip, because no prefix can match after such a skip (such skips are longer than the skips for mismatches during right scans, for which it gets reset).Further evidence that this is the correct fix is that if you remove
byteset
entirely, the false positives go away.I've also added extensive comments, maybe too many. Let me know if I've gone into too much detail.