U+1F6xx block: emoticons and dingbats #18

chungy · 2017-09-28T01:36:47Z

The emoticons might end up being especially contentious, with styles and opinions both varying wildly, and I've only tried to replicate ones that seem to be pretty simple.

For the emoji that aren't really representable in ASCII, I'm not sure what should be done. Maybe just the colon-sandwiched codes as seen in some messengers and GitHub? Eg, :cheese_wedge: or :wolf:

avian2 · 2017-09-29T07:58:59Z

Yes, emoticons are problematic. I would try to be consistent with the textual description provided by Unicode. e.g. have all emoticons where description says "open mouth" use "D" etc. In your patch for example U+1F603 "Smiling face with open mouth" is :-), but U+1F604 "Smiling face with open mouth and smiling eyes" is :-D. I would put both as :-D.

Regarding other emojis, I agree colon codes seem to be the best solution. Problems I see are:

They are not completely standardized (different systems have slightly different names as far as I know).
They are a form of machine-readable text markup. Unidecode I think shouldn't output markup.

But honestly I don't see any other solution. They seem to be the de-facto way people put emojis into plain ASCII.

chungy · 2017-10-14T01:13:33Z

Sorry it's taken so long for me to follow up on your comments, but I appreciate them a lot. Making some standards on the emoticons, matching representations based on the descriptions seems to be a good idea.

We can base the more graphic-oriented emojis similarly, making up colon syntaxes based on Unicode name rather than any specific application. My only problem is that the name of the character can be rather verbose and long.

As an example, 🖖 is represented in Keybase with :spock-hand:, but the Unicode name is “RAISED HAND WITH PART BETWEEN MIDDLE AND RING FINGERS”. I think that :raised-hand-with-part-between-middle-and-ring-fingers: is not quite desirable, but I don't know the best course here.

avian2 · 2017-10-17T08:51:34Z

I haven't seen actual Unicode names used in this way. I agree that using them for colon codes wouldn't be the best. The de-facto standard seems to be "short codes", like listed here:

https://www.webpagefx.com/tools/emoji-cheat-sheet

I don't think these are condoned by Unicode and I don't know where they originally came from. Some software library or Wordpress perhaps? I seem to remember seeing a page that listed differences in these codes between different services, but I can't find it at the moment.

Stealthii · 2017-11-08T12:03:58Z

Unidecode already translates based on romanisation and pronunciation of some foreign characters, and I believe the intention of this library is to convey representation - after all, our output is ambiguous at best, as the actual truth lies within the original unicode literal.

For this reason, I think the best effort approach is to convey the clearest meaning, and I would suggest the cheat-sheet that @avian2 linked, as the shortcodes described there are a de-facto implementation used by most chat clients.

There is nothing to say we can't change this in future, right?

mvasilkov · 2018-06-18T10:17:44Z

May I suggest not using smiley faces made of punctuation? IMO shortcodes like :open_mouth: are much better than :-O.

Punctuation smileys are very culture-specific. E.g. ¯\_(ツ)_/¯
With words, there is less ambiguity. E.g. :-Down << Is this "😄own" or "down"?
One of the prominent use cases for this library is generating web slugs (permanent URLs). So the following may happen:
- original title: "You are mine 😄"
- unidecode: "You are mine :-D"
- URL: /you-are-mine-d

avian2 · 2018-06-18T15:21:46Z

On the other hand, :open_mouth: is language-specific. In my opinion punctuation smileys (where applicable) would actually be more universal in that respect.

The problem of a smiley possibly merging with an adjacent word can be solved by surrounding them with leading and/or trailing space. Unidecode already does that for some symbols.

mvasilkov · 2018-06-18T17:57:05Z

Ah, I assumed it was targeting English the whole time, since it mentions the US keyboard layout in the readme, and also this: https://github.com/avian2/unidecode/blob/master/unidecode/x033.py

But fair enough I guess.

avian2 · 2018-06-19T19:07:44Z

I wouldn't like to target English any more than the fact that Unidecode transliterates into a character set with an US origin. US keyboard layout is mentioned because that is the most common layout used to enter ASCII text. I used it as an illustration of what problem Unidecode tries to solve - imaging a person trying to enter non-English words into a computer that only accepts ASCII through an American keyboard.

I am not familiar with the U+33xx Unicode page you mention. Codepoint descriptions suggest these represent those specific English words (I'm guessing for use in Japanese text?)

To be honest, I see no perfect solution for emojis, and Unidecode is about compromise. I think English short-codes as discussed above would be a good start.

U+1F6xx block: emoticons and dingbats

f7e5647

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

U+1F6xx block: emoticons and dingbats #18

U+1F6xx block: emoticons and dingbats #18

chungy commented Sep 28, 2017

avian2 commented Sep 29, 2017

chungy commented Oct 14, 2017

avian2 commented Oct 17, 2017

Stealthii commented Nov 8, 2017

mvasilkov commented Jun 18, 2018

avian2 commented Jun 18, 2018

mvasilkov commented Jun 18, 2018

avian2 commented Jun 19, 2018

U+1F6xx block: emoticons and dingbats #18

Are you sure you want to change the base?

U+1F6xx block: emoticons and dingbats #18

Conversation

chungy commented Sep 28, 2017

avian2 commented Sep 29, 2017

chungy commented Oct 14, 2017

avian2 commented Oct 17, 2017

Stealthii commented Nov 8, 2017

mvasilkov commented Jun 18, 2018

avian2 commented Jun 18, 2018

mvasilkov commented Jun 18, 2018

avian2 commented Jun 19, 2018