Normalizing fancy characters #72

GokulNC · 2021-10-30T08:10:02Z

Thanks for making the library, it's really helpful in my case for cleaning social media texts.

Here are some cases where the transliteration/conversion was not correct (Version 1.3.2):

>>> from unidecode import unidecode
>>> unidecode("ᕼᗩᑭᑭIᗴᗴ")
'hpokikiIgaga'
>>> unidecode("🇦🇷🇮")
''
>>> unidecode("ωεłł")
'oell'
>>> unidecode("RᗅIPႮ")
'RghoIPP'
>>> unidecode("ғʀᴇᴇ")
"g'REE"

I will update this issue with more examples as I come across. Thanks!

Edit:

It looks like most of the issues is because they are characters of some other scripts, not fancy characters.
So, is there some way to do appearance-based conversion rather than approximate-phonetic conversion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizing fancy characters #72

Normalizing fancy characters #72

GokulNC commented Oct 30, 2021 •

edited

Loading

Normalizing fancy characters #72

Normalizing fancy characters #72

Comments

GokulNC commented Oct 30, 2021 • edited Loading

GokulNC commented Oct 30, 2021 •

edited

Loading