Skip to content

Unicode and UnicodeString class

galkahana edited this page Feb 14, 2013 · 2 revisions

The library uses strings for various reasons, among which are:

  • File names
  • URL links
  • Text for writing text
  • Logging and tracing

In each of these cases (and any others i fail to mention) the input is either an std string object or a plain char pointer. Any of these are to be encoded in UTF8. This is rather native for Unix or Mac OSX, however some others, including Windows require wide character usage and UTF16 encoding. Well, the wide char for mac and unix is 4 bytes, so i can't use it, due to the difference - for this is had to go with the UTF8 method.

Two points about that:

  1. Windows still requires the non-ansi "wfopen" to open files with a wide character. i got this ifdefed in my file stream IO (actually in a separate file here referenced by both ) on condition of WIN32 pre-processor symbol existence. Note that in all the rest i'm using "fopen" for file opening. This bares meaning on the type of file names you can provide. for Mac OSX it means you should pass POSIX paths. If you want something else...use a custom stream implementation. See in Custom input and output on how to do that.

  2. To help you with conversions, in case you want such help, and for my own usages, there's the UnicodeString, where i'm implementing all encoding conversions that i deemed important. Note that internally UnicodeString holds unsigned longs for each character, making it essentially UCS4/UTF32 encoded, or simply put - the unicode values themselves.

The rest of the discussion will relate to UnicodeString methods:

UTF8

Two methods relate to UTF8, and you can use them for conversions sake:

EStatusCode FromUTF8(const string& inString);
EStatusCodeAndString ToUTF8() const;

The FromUTF8 gets an std string object encoded in UTF8 and builds the string internal representation with the matching unicode values.

The ToUTF8 returns a UTF8 encoded string paired with a status. Check the status, and only if it's OK then the string is valid. Not that it is supposed to fail if you got proper Unicode values there.

UTF16/UCS2

UTF16 is very interesting for Windows, where a 2 byte wchar_t or simply unsigned shorts, both encoded using UTF16 or UCS2, sort of encoding.

You can use the unicode class to convert to and from unsigned shorts encoded as UTF16 (No BOM! byte ordering is implied by OS):

EStatusCode FromUTF16UShort(const unsigned short* inShorts, unsigned long inLength);

EStatusCodeAndUShortList ToUTF16UShort() const;

FromUTF16UShort converts a UTF16 encoded unsigned short input to the internal unicode representation. This can be your way out if you are using wstring on Windows (or any system that uses 2 bytes for wchar_t). Just do the relevant casting and pass to this method.

ToUTF16UShort will return a list of short values encoded as UTF16, which can be used to initiate a matching 2 byte wstring, or for whatever other usage you have in mind. Here too there's a status code, in case you played with the unicode values (and put some single surrogate values or something of that naughty sort).

Other methods are for actual encoding to/from UTF16. They remain in single bytes, but encoded as UTF16. Here, of course, there's importance to mentioning the byte order, or using the right method:

// convert from UTF16 string, requires BOM
EStatusCode FromUTF16(const string& inString);
EStatusCode FromUTF16(const unsigned char* inString, unsigned long inLength);

// convert from UTF16BE, do not include BOM
EStatusCode FromUTF16BE(const string& inString);
EStatusCode FromUTF16BE(const unsigned char* inString, unsigned long inLength);

// convert from UTF16LE do not include BOM
EStatusCode FromUTF16LE(const string& inString);
EStatusCode FromUTF16LE(const unsigned char* inString, unsigned long inLength);


// convert to UTF16 BE
EStatusCodeAndString ToUTF16BE(bool inPrependWithBom) const;
	
// convert to UTF16 LE
EStatusCodeAndString ToUTF16LE(bool inPrependWithBom) const;

The first 6 "From" methods deal with byte input encoded in UTF16. Note that there's always an option to use either std strings or unsigned chars. Whatever gets you kickin'. The first pair - FromUTF16 - gets an input with a BOM, so it can determine the byte order and encode accordingly. The other two pairs - FromUTF16BE and 'FromUTF16LE` - allow you to get input from UTF16 where you know the byte order. In this case no BOM is required (in fact...make sure not to pass one).

The last two pairs - ToUTF16BE and ToUTF16LE - can be used in case you are looking for encoding to a byte string of UTF16. You can order them to either place a BOM or not.

UCS4/UTF32

There are not really any UCS4/UTF32 handling, so there's not direct support of Mac OSX 4 byte wchar_t. HOWEVER, note that the internal representing is actually 4 byte, so you can access it directly and just set the values from the wchar_t string directly:

const ULongList& GetUnicodeList() const;
ULongList& GetUnicodeList();

The GetUnicodeList methods just give you direct access to the internal representation of unicode, and can be considered as the UCS4/UTF32 encoding, for all relevant matters.

Clone this wiki locally