Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upper/lower case issues with non-english texts #39

Open
humanzz opened this issue Aug 22, 2016 · 4 comments
Open

Upper/lower case issues with non-english texts #39

humanzz opened this issue Aug 22, 2016 · 4 comments

Comments

@humanzz
Copy link

humanzz commented Aug 22, 2016

Hello,
I was using the ahocorasick library to perform some text processing on English Wikipedia dump. As it contains several non-english words, it lead me to discovering an issue in how the library handles cases.

I wrote a test case to explain it. Here it is

@Test
public void caseInsensitiveTrieWithSomeUnicodeCharactersCreatesEmitsWithWrongStart(){
    // when this is lower cased, it becomes a string of length 2!
    String upperLengthOne = "İ";
    String normalI = "I";
    Trie trie = Trie.builder()
                    .caseInsensitive()
                    //.onlyWholeWords()
                    .addKeyword(upperLengthOne)
                    .build();
    // because when lower cased it becomes 2 characters, the emit gets confused and creates a string that starts at -1
    // this can cause further problems if we index into the original string with this emit start and end.
    // This happens if we make the trie builder.onlyWholeWords() in which case we get exceptions
    assertEquals(-1, trie.parseText(upperLengthOne).stream().findAny().get().getStart());
}

The whole problem happens when lower-casing the keyword has a different length than the original keyword. I can't think of a quick easy fix for this issue, but I guess the least that can be done about it is that the library should throw an error when the lowercased keyword's length is != the original keywords length warning the library's user to this issue.

@humanzz
Copy link
Author

humanzz commented Aug 22, 2016

Building on the fix to unicode issues in #8, I fixed the problem in the pull request above.

@matanox
Copy link

matanox commented Dec 2, 2017

Was a solution finally merged?

@ghost
Copy link

ghost commented Dec 3, 2017

I don't believe the merge can happen until some of the conflicts have been resolved.

@matanox
Copy link

matanox commented Dec 3, 2017

Sounds like a mildly horrific bug to keep in circulation :)
Thanks for spotting it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants