Upper/lower case issues with non-english texts #39

humanzz · 2016-08-22T00:03:46Z

Hello,
I was using the ahocorasick library to perform some text processing on English Wikipedia dump. As it contains several non-english words, it lead me to discovering an issue in how the library handles cases.

I wrote a test case to explain it. Here it is

@Test
public void caseInsensitiveTrieWithSomeUnicodeCharactersCreatesEmitsWithWrongStart(){
    // when this is lower cased, it becomes a string of length 2!
    String upperLengthOne = "İ";
    String normalI = "I";
    Trie trie = Trie.builder()
                    .caseInsensitive()
                    //.onlyWholeWords()
                    .addKeyword(upperLengthOne)
                    .build();
    // because when lower cased it becomes 2 characters, the emit gets confused and creates a string that starts at -1
    // this can cause further problems if we index into the original string with this emit start and end.
    // This happens if we make the trie builder.onlyWholeWords() in which case we get exceptions
    assertEquals(-1, trie.parseText(upperLengthOne).stream().findAny().get().getStart());
}

The whole problem happens when lower-casing the keyword has a different length than the original keyword. I can't think of a quick easy fix for this issue, but I guess the least that can be done about it is that the library should throw an error when the lowercased keyword's length is != the original keywords length warning the library's user to this issue.

The text was updated successfully, but these errors were encountered:

humanzz · 2016-08-22T01:24:18Z

Building on the fix to unicode issues in #8, I fixed the problem in the pull request above.

matanox · 2017-12-02T01:09:35Z

Was a solution finally merged?

ghost · 2017-12-03T01:31:36Z

I don't believe the merge can happen until some of the conflicts have been resolved.

matanox · 2017-12-03T09:05:09Z

Sounds like a mildly horrific bug to keep in circulation :)
Thanks for spotting it

humanzz mentioned this issue Aug 22, 2016

Fix unicode issue which caused wrong emit start/end #40

Closed

ghost mentioned this issue Aug 23, 2020

Fix unicode issue which caused wrong emit start/end #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upper/lower case issues with non-english texts #39

Upper/lower case issues with non-english texts #39

humanzz commented Aug 22, 2016

humanzz commented Aug 22, 2016 •

edited

Loading

matanox commented Dec 2, 2017

ghost commented Dec 3, 2017

matanox commented Dec 3, 2017 •

edited

Loading

Upper/lower case issues with non-english texts #39

Upper/lower case issues with non-english texts #39

Comments

humanzz commented Aug 22, 2016

humanzz commented Aug 22, 2016 • edited Loading

matanox commented Dec 2, 2017

ghost commented Dec 3, 2017

matanox commented Dec 3, 2017 • edited Loading

humanzz commented Aug 22, 2016 •

edited

Loading

matanox commented Dec 3, 2017 •

edited

Loading