Saturday, 26 July 2014

Unicode and Encodings

Here is a summary of all things Unicode:

  • Unicode maps 32-bit (4 byte) integers (code points) to characters
  • The first 127 code points (hex values 00 to 7f) are the same as ASCII 
  • The next 128 code points (0×80-0xff) are the same as ISO-8859-1
  • An encoding is a mapping from bytes to Unicode code points 
Character Reference and Code Tables
  • A plane is a continuous group of 65,536 (= 2^16) code points 
  • There are 17 planes, identified by the numbers 0 to 16 
  • The Basic Multilingual Plane (BMP) is plane 0 (0000–​FFFF)
  • Planes 1–16, are called “supplementary planes” 
  • The code points in each plane have the hexadecimal values xx0000 to xxFFFF, where xx is a hex value from 00 to 10, signifying the plane to which the values belong
UTF-8 Encoding
  • Encodes code-points as one or two 16-bit code units
  • The code-points defined by the BMP are encoded as single 16-bit code units that are numerically equal to the corresponding code points
  • Code points from the Supplementary Planes are encoded by pairs of 16-bit code units called surrogate pairs:
  • Uses exactly 32 bits per Unicode code point.
  • The UTF-32 form of a character is a direct representation of its codepoint
  • Example: 00 00 00 61 is UTF-32 for Unicode code point 61, which is 'a' 
Byte Order Mark (BOM)
  • U+FEFF
  • If the endian architecture of the decoder matches that of the encoder, the decoder detects the 0xFEFF value, but an opposite-endian decoder interprets the BOM as the non-character value U+FFFE reserved for this purpose. This incorrect result provides a hint to perform byte-swapping for the remaining values
  • In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream
  • The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF
  • The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8 
  • HTML Entity: å (decimal) or å (hex) (= å)
URL Unicode Encoding
  • UTF-16: %uXXXX, e.g. %u00e9 -> é
  • UTF-8: %XX[%XX][%XX][%XX], e.g. %c2%a9 -> © %e2%89%a0 -> ≠
Compiled from:

No comments:

Post a Comment