Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U+1F6xx block: emoticons and dingbats #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chungy
Copy link
Contributor

@chungy chungy commented Sep 28, 2017

The emoticons might end up being especially contentious, with styles and opinions both varying wildly, and I've only tried to replicate ones that seem to be pretty simple.

For the emoji that aren't really representable in ASCII, I'm not sure what should be done. Maybe just the colon-sandwiched codes as seen in some messengers and GitHub? Eg, :cheese_wedge: or :wolf:

@avian2
Copy link
Owner

avian2 commented Sep 29, 2017

Yes, emoticons are problematic. I would try to be consistent with the textual description provided by Unicode. e.g. have all emoticons where description says "open mouth" use "D" etc. In your patch for example U+1F603 "Smiling face with open mouth" is :-), but U+1F604 "Smiling face with open mouth and smiling eyes" is :-D. I would put both as :-D.

Regarding other emojis, I agree colon codes seem to be the best solution. Problems I see are:

  • They are not completely standardized (different systems have slightly different names as far as I know).
  • They are a form of machine-readable text markup. Unidecode I think shouldn't output markup.

But honestly I don't see any other solution. They seem to be the de-facto way people put emojis into plain ASCII.

@chungy
Copy link
Contributor Author

chungy commented Oct 14, 2017

Sorry it's taken so long for me to follow up on your comments, but I appreciate them a lot. Making some standards on the emoticons, matching representations based on the descriptions seems to be a good idea.

We can base the more graphic-oriented emojis similarly, making up colon syntaxes based on Unicode name rather than any specific application. My only problem is that the name of the character can be rather verbose and long.

As an example, 🖖 is represented in Keybase with :spock-hand:, but the Unicode name is “RAISED HAND WITH PART BETWEEN MIDDLE AND RING FINGERS”. I think that :raised-hand-with-part-between-middle-and-ring-fingers: is not quite desirable, but I don't know the best course here.

@avian2
Copy link
Owner

avian2 commented Oct 17, 2017

I haven't seen actual Unicode names used in this way. I agree that using them for colon codes wouldn't be the best. The de-facto standard seems to be "short codes", like listed here:

https://www.webpagefx.com/tools/emoji-cheat-sheet

I don't think these are condoned by Unicode and I don't know where they originally came from. Some software library or Wordpress perhaps? I seem to remember seeing a page that listed differences in these codes between different services, but I can't find it at the moment.

@Stealthii
Copy link

Unidecode already translates based on romanisation and pronunciation of some foreign characters, and I believe the intention of this library is to convey representation - after all, our output is ambiguous at best, as the actual truth lies within the original unicode literal.

For this reason, I think the best effort approach is to convey the clearest meaning, and I would suggest the cheat-sheet that @avian2 linked, as the shortcodes described there are a de-facto implementation used by most chat clients.

There is nothing to say we can't change this in future, right?

@mvasilkov
Copy link

May I suggest not using smiley faces made of punctuation? IMO shortcodes like :open_mouth: are much better than :-O.

  • Punctuation smileys are very culture-specific. E.g. ¯\_(ツ)_/¯
  • With words, there is less ambiguity. E.g. :-Down << Is this "😄own" or "down"?
  • One of the prominent use cases for this library is generating web slugs (permanent URLs). So the following may happen:
    • original title: "You are mine 😄"
    • unidecode: "You are mine :-D"
    • URL: /you-are-mine-d

@avian2
Copy link
Owner

avian2 commented Jun 18, 2018

On the other hand, :open_mouth: is language-specific. In my opinion punctuation smileys (where applicable) would actually be more universal in that respect.

The problem of a smiley possibly merging with an adjacent word can be solved by surrounding them with leading and/or trailing space. Unidecode already does that for some symbols.

@mvasilkov
Copy link

Ah, I assumed it was targeting English the whole time, since it mentions the US keyboard layout in the readme, and also this: https://github.com/avian2/unidecode/blob/master/unidecode/x033.py

But fair enough I guess.

@avian2
Copy link
Owner

avian2 commented Jun 19, 2018

I wouldn't like to target English any more than the fact that Unidecode transliterates into a character set with an US origin. US keyboard layout is mentioned because that is the most common layout used to enter ASCII text. I used it as an illustration of what problem Unidecode tries to solve - imaging a person trying to enter non-English words into a computer that only accepts ASCII through an American keyboard.

I am not familiar with the U+33xx Unicode page you mention. Codepoint descriptions suggest these represent those specific English words (I'm guessing for use in Japanese text?)

To be honest, I see no perfect solution for emojis, and Unidecode is about compromise. I think English short-codes as discussed above would be a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants