Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode conversion is buggy #48

Open
eromoe opened this issue Dec 19, 2017 · 1 comment
Open

unicode conversion is buggy #48

eromoe opened this issue Dec 19, 2017 · 1 comment

Comments

@eromoe
Copy link

eromoe commented Dec 19, 2017

I think this problem cause python stop working #47 , so open a separated issue.

example:

unique_text: https://pastebin.com/n2i280i8

    import datrie
    text = htmls_2_text(input_dir)
    unique_text = ''.join(set(text))
    trie = datrie.Trie(unique_text )
    trie['今天天气真好'] = 111
    trie['今天好'] = 222
    trie['今天'] = 444
    
    print(trie.items())
    
    [('今义', 444), ('今义义傲兢于', 111), ('今义于', 222)]

got wrong word .


I tried to locate the error:

Error

u = ''.join(set('今天天气真好' + unique_text[:400]))

got [('今天', 444), ('今天天气I好', 111), ('今天好', 222)]

Correct

u = ''.join(set('今天天气真好' + unique_text[:396]))
u = ''.join(set('今天天气真好' + unique_text[:396]+unique_text[398:400]))
u = ''.join(set('今天天气真好' + unique_text[396:400]))

all correct.

@Cherrymelon
Copy link

i meet same bug,unicode chinese character map wrong character

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants